Обсуждение: Adding basic NUMA awareness
Hi, This is a WIP version of a patch series I'm working on, adding some basic NUMA awareness for a couple parts of our shared memory (shared buffers, etc.). It's based on Andres' experimental patches he spoke about at pgconf.eu 2024 [1], and while it's improved and polished in various ways, it's still experimental. But there's a recent thread aiming to do something similar [2], so better to share it now so that we can discuss both approaches. This patch set is a bit more ambitious, handling NUMA in a way to allow smarter optimizations later, so I'm posting it in a separate thread. The series is split into patches addressing different parts of the shared memory, starting (unsurprisingly) from shared buffers, then buffer freelists and ProcArray. There's a couple additional parts, but those are smaller / addressing miscellaneous stuff. Each patch has a numa_ GUC, intended to enable/disable that part. This is meant to make development easier, not as a final interface. I'm not sure how exactly that should look. It's possible some combinations of GUCs won't work, etc. Each patch should have a commit message explaining the intent and implementation, and then also detailed comments explaining various challenges and open questions. But let me go over the basics, and discuss some of the design choices and open questions that need solving. 1) v1-0001-NUMA-interleaving-buffers.patch This is the main thing when people think about NUMA - making sure the shared buffers are allocated evenly on all the nodes, not just on a single node (which can happen easily with warmup). The regular memory interleaving would address this, but it also has some disadvantages. Firstly, it's oblivious to the contents of the shared memory segment, and we may not want to interleave everything. It's also oblivious to alignment of the items (a buffer can easily end up "split" on multiple NUMA nodes), or relationship between different parts (e.g. there's a BufferBlock and a related BufferDescriptor, and those might again end up on different nodes). So the patch handles this by explicitly mapping chunks of shared buffers to different nodes - a bit like interleaving, but in larger chunks. Ideally each node gets (1/N) of shared buffers, as a contiguous chunk. It's a bit more complicated, because the patch distributes both the blocks and descriptors, in the same way. So a buffer and it's descriptor always end on the same NUMA node. This is one of the reasons why we need to map larger chunks, because NUMA works on page granularity, and the descriptors are tiny - many fit on a memory page. There's a secondary benefit of explicitly assigning buffers to nodes, using this simple scheme - it allows quickly determining the node ID given a buffer ID. This is helpful later, when building freelist. The patch is fairly simple. Most of the complexity is about picking the chunk size, and aligning the arrays (so that it nicely aligns with memory pages). The patch has a GUC "numa_buffers_interleave", with "off" by default. 2) v1-0002-NUMA-localalloc.patch This simply sets "localalloc" when initializing a backend, so that all memory allocated later is local, not interleaved. Initially this was necessary because the patch set the allocation policy to interleaving before initializing shared memory, and we didn't want to interleave the private memory. But that's no longer the case - the explicit mapping to nodes does not have this issue. I'm keeping the patch for convenience, it allows experimenting with numactl etc. The patch has a GUC "numa_localalloc", with "off" by default. 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch Minor optimization. Andres noticed we're tracking the tail of buffer freelist, without using it. So the patch removes that. 4) v1-0004-NUMA-partition-buffer-freelist.patch Right now we have a single freelist, and in busy instances that can be quite contended. What's worse, the freelist may trash between different CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on subsets of buffers. The patch implements multiple strategies how the list can be split (configured using "numa_partition_freelist" GUC), for experimenting: * node - One list per NUMA node. This is the most natural option, because we now know which buffer is on which node, so we can ensure a list for a node only has buffers from that list. * cpu - One list per CPU. Pretty simple, each CPU gets it's own list. * pid - Similar to "cpu", but the processes are mapped to lists based on PID, not CPU ID. * none - nothing, sigle freelist Ultimately, I think we'll want to go with "node", simply because it aligns with the buffer interleaving. But there are improvements needed. The main challenge is that with multiple smaller lists, a process can't really use the whole shared buffers. So a single backed will only use part of the memory. The more lists there are, the worse this effect is. This is also why I think we won't use the other partitioning options, because there's going to be more CPUs than NUMA nodes. Obviously, this needs solving even with NUMA nodes - we need to allow a single backend to utilize the whole shared buffers if needed. There should be a way to "steal" buffers from other freelists (if the "regular" freelist is empty), but the patch does not implement this. Shouldn't be hard, I think. The other missing part is clocksweep - there's still just a single instance of clocksweep, feeding buffers to all the freelists. But that's clearly a problem, because the clocksweep returns buffers from all NUMA nodes. The clocksweep really needs to be partitioned the same way as a freelists, and each partition will operate on a subset of buffers (from the right NUMA node). I do have a separate experimental patch doing something like that, I need to make it part of this branch. 5) v1-0005-NUMA-interleave-PGPROC-entries.patch Another area that seems like it might benefit from NUMA is PGPROC, so I gave it a try. It turned out somewhat challenging. Similarly to buffers we have two pieces that need to be located in a coordinated way - PGPROC entries and fast-path arrays. But we can't use the same approach as for buffers/descriptors, because (a) Neither of those pieces aligns with memory page size (PGPROC is ~900B, fast-path arrays are variable length). (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require rather high max_connections before we use multiple huge pages. The fast-path arrays are less of a problem, because those tend to be larger, and are accessed through pointers, so we can just adjust that. So what I did instead is splitting the whole PGPROC array into one array per NUMA node, and one array for auxiliary processes and 2PC xacts. So with 4 NUMA nodes there are 5 separate arrays, for example. Each array is a multiple of memory pages, so we may waste some of the memory. But that's simply how NUMA works - page granularity. This however makes one particular thing harder - in a couple places we accessed PGPROC entries through PROC_HDR->allProcs, which was pretty much just one large array. And GetNumberFromPGProc() relied on array arithmetics to determine procnumber. With the array partitioned, this can't work the same way. But there's a simple solution - if we turn allProcs into an array of *pointers* to PGPROC arrays, there's no issue. All the places need a pointer anyway. And then we need an explicit procnumber field in PGPROC, instead of calculating it. There's a chance this have negative impact on code that accessed PGPROC very often, but so far I haven't seen such cases. But if you can come up with such examples, I'd like to see those. There's another detail - when obtaining a PGPROC entry in InitProcess(), we try to get an entry from the same NUMA node. And only if that doesn't work, we grab the first one from the list (there's still just one PGPROC freelist, I haven't split that - maybe we should?). This has a GUC "numa_procs_interleave", again "off" by default. It's not quite correct, though, because the partitioning happens always. It only affects the PGPROC lookup. (In a way, this may be a bit broken.) 6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch This is an experimental patch, that simply pins the new process to the NUMA node obtained from the freelist. Driven by GUC "numa_procs_pin" (default: off). Summary ------- So this is what I have at the moment. I've tried to organize the patches in the order of importance, but that's just my guess. It's entirely possible there's something I missed, some other order might make more sense, etc. There's also the question how this is related to other patches affecting shared memory - I think the most relevant one is the "shared buffers online resize" by Ashutosh, simply because it touches the shared memory. I don't think the splitting would actually make some things simpler, or maybe more flexible - in particular, it'd allow us to enable huge pages only for some regions (like shared buffers), and keep the small pages e.g. for PGPROC. So that'd be good. But there'd also need to be some logic to "rework" how shared buffers get mapped to NUMA nodes after resizing. It'd be silly to start with memory on 4 nodes (25% each), resize shared buffers to 50% and end up with memory only on 2 of the nodes (because the other 2 nodes were originally assigned the upper half of shared buffers). I don't have a clear idea how this would be done, but I guess it'd require a bit of code invoked sometime after the resize. It'd already need to rebuild the freelists in some way, I guess. The other thing I haven't thought about very much is determining on which CPUs/nodes the instance is allowed to run. I assume we'd start by simply inherit/determine that at the start through libnuma, not through some custom PG configuration (which the patch [2] proposed to do). regards [1] https://www.youtube.com/watch?v=V75KpACdl6E [2] https://www.postgresql.org/message-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw%40mail.gmail.com -- Tomas Vondra
Вложения
On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote: > > > 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch > > Minor optimization. Andres noticed we're tracking the tail of buffer > freelist, without using it. So the patch removes that. > The patches for resizing buffers use the lastFreeBuffer to add new buffers to the end of free list when expanding it. But we could as well add it at the beginning of the free list. This patch seems almost independent of the rest of the patches. Do you need it in the rest of the patches? I understand that those patches don't need to worry about maintaining lastFreeBuffer after this patch. Is there any other effect? If we are going to do this, let's do it earlier so that buffer resizing patches can be adjusted. > > There's also the question how this is related to other patches affecting > shared memory - I think the most relevant one is the "shared buffers > online resize" by Ashutosh, simply because it touches the shared memory. I have added Dmitry to this thread since he has written most of the shared memory handling code. > > I don't think the splitting would actually make some things simpler, or > maybe more flexible - in particular, it'd allow us to enable huge pages > only for some regions (like shared buffers), and keep the small pages > e.g. for PGPROC. So that'd be good. The resizing patches split the shared buffer related structures into separate memory segments. I think that itself will help enabling huge pages for some regions. Would that help in your case? > > But there'd also need to be some logic to "rework" how shared buffers > get mapped to NUMA nodes after resizing. It'd be silly to start with > memory on 4 nodes (25% each), resize shared buffers to 50% and end up > with memory only on 2 of the nodes (because the other 2 nodes were > originally assigned the upper half of shared buffers). > > I don't have a clear idea how this would be done, but I guess it'd > require a bit of code invoked sometime after the resize. It'd already > need to rebuild the freelists in some way, I guess. Yes, there's code to build the free list. I think we will need code to remap the buffers and buffer descriptor. -- Best Wishes, Ashutosh Bapat
On 7/2/25 13:37, Ashutosh Bapat wrote: > On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote: >> >> >> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch >> >> Minor optimization. Andres noticed we're tracking the tail of buffer >> freelist, without using it. So the patch removes that. >> > > The patches for resizing buffers use the lastFreeBuffer to add new > buffers to the end of free list when expanding it. But we could as > well add it at the beginning of the free list. > > This patch seems almost independent of the rest of the patches. Do you > need it in the rest of the patches? I understand that those patches > don't need to worry about maintaining lastFreeBuffer after this patch. > Is there any other effect? > > If we are going to do this, let's do it earlier so that buffer > resizing patches can be adjusted. > My patches don't particularly rely on this bit, it would work even with lastFreeBuffer. I believe Andres simply noticed the current code does not use lastFreeBuffer, it just maintains is, so he removed that as an optimization. I don't know how significant is the improvement, but if it's measurable we could just do that independently of our patches. >> >> There's also the question how this is related to other patches affecting >> shared memory - I think the most relevant one is the "shared buffers >> online resize" by Ashutosh, simply because it touches the shared memory. > > I have added Dmitry to this thread since he has written most of the > shared memory handling code. > Thanks. >> >> I don't think the splitting would actually make some things simpler, or >> maybe more flexible - in particular, it'd allow us to enable huge pages >> only for some regions (like shared buffers), and keep the small pages >> e.g. for PGPROC. So that'd be good. > > The resizing patches split the shared buffer related structures into > separate memory segments. I think that itself will help enabling huge > pages for some regions. Would that help in your case? > Indirectly. My patch can work just fine with a single segment, but being able to enable huge pages only for some of the segments seems better. >> >> But there'd also need to be some logic to "rework" how shared buffers >> get mapped to NUMA nodes after resizing. It'd be silly to start with >> memory on 4 nodes (25% each), resize shared buffers to 50% and end up >> with memory only on 2 of the nodes (because the other 2 nodes were >> originally assigned the upper half of shared buffers). >> >> I don't have a clear idea how this would be done, but I guess it'd >> require a bit of code invoked sometime after the resize. It'd already >> need to rebuild the freelists in some way, I guess. > > Yes, there's code to build the free list. I think we will need code to > remap the buffers and buffer descriptor. > Right. The good thing is that's just "advisory" information, it doesn't break anything if it's temporarily out of sync. We don't need to "stop" everything to remap the buffers to other nodes, or anything like that. Or at least I think so. It's one thing to "flip" the target mapping (determining which node a buffer should be on), and actually migrating the buffers. The first part can be done instantaneously, the second part can happen in the background over a longer time period. I'm not sure how you're rebuilding the freelist. Presumably it can contain buffers that are no longer valid (after shrinking). How is that handled to not break anything? I think the NUMA variant would do exactly the same thing, except that there's multiple lists. regards -- Tomas Vondra
On Wed, Jul 2, 2025 at 6:06 PM Tomas Vondra <tomas@vondra.me> wrote: > > I'm not sure how you're rebuilding the freelist. Presumably it can > contain buffers that are no longer valid (after shrinking). How is that > handled to not break anything? I think the NUMA variant would do exactly > the same thing, except that there's multiple lists. Before shrinking the buffers, we walk the free list removing any buffers that are going to be removed. When expanding, by linking the new buffers in the order and then adding those to the already existing free list. 0005 patch in [1] has the code for the same. [1] https://www.postgresql.org/message-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj%40hmuxsf2ngov2 -- Best Wishes, Ashutosh Bapat
> On Wed, Jul 02, 2025 at 05:07:28PM +0530, Ashutosh Bapat wrote: > > There's also the question how this is related to other patches affecting > > shared memory - I think the most relevant one is the "shared buffers > > online resize" by Ashutosh, simply because it touches the shared memory. > > I have added Dmitry to this thread since he has written most of the > shared memory handling code. Thanks! I like the idea behind this patch series. I haven't read it in details yet, but I can imagine both patches (interleaving and online resizing) could benefit from each other. In online resizing we've introduced a possibility to use multiple shared mappings for different types of data, maybe it would be convenient to use the same interface to create separate mappings for different NUMA nodes as well. Using a separate shared mapping per NUMA node would also make resizing easier, since it would be more straightforward to fit an increased segment into NUMA boundaries. > > I don't think the splitting would actually make some things simpler, or > > maybe more flexible - in particular, it'd allow us to enable huge pages > > only for some regions (like shared buffers), and keep the small pages > > e.g. for PGPROC. So that'd be good. > > The resizing patches split the shared buffer related structures into > separate memory segments. I think that itself will help enabling huge > pages for some regions. Would that help in your case? Right, separate segments would allow to mix and match huge pages with pages of regular size. It's not implemented in the latest version of online resizing patch, purely to reduce complexity and maintain the same invariant (everything is either using huge pages or not) -- but we could do it other way around as well.
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote: Hi! > 1) v1-0001-NUMA-interleaving-buffers.patch [..] > It's a bit more complicated, because the patch distributes both the > blocks and descriptors, in the same way. So a buffer and it's descriptor > always end on the same NUMA node. This is one of the reasons why we need > to map larger chunks, because NUMA works on page granularity, and the > descriptors are tiny - many fit on a memory page. Oh, now I get it! OK, let's stick to this one. > I don't think the splitting would actually make some things simpler, or > maybe more flexible - in particular, it'd allow us to enable huge pages > only for some regions (like shared buffers), and keep the small pages > e.g. for PGPROC. So that'd be good. You have made assumption that this is good, but small pages (4KB) are not hugetlb, and are *swappable* (Transparent HP are swappable too, manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The most frequent problem I see these days are OOMs, and it makes me believe that making certain critical parts of shared memory being swappable just to make pagesize granular is possibly throwing the baby out with the bathwater. I'm thinking about bad situations like: some wrong settings of vm.swapiness that people keep (or distros keep?) and general inability of PG to restrain from allocating more memory in some cases. > The other thing I haven't thought about very much is determining on > which CPUs/nodes the instance is allowed to run. I assume we'd start by > simply inherit/determine that at the start through libnuma, not through > some custom PG configuration (which the patch [2] proposed to do). 0. I think that we could do better, some counter arguments to no-configuration-at-all: a. as Robert & Bertrand already put it there after review: let's say I want just to run on NUMA #2 node, so here I would need to override systemd's script ExecStart= to include that numactl (not elegant?). I could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even less friendly. Also it probably requires root to edit/reload systemd, while having GUC for this like in my proposal makes it more smooth (I think?) b. wouldn't it be better if that stayed as drop-in rather than always on? What if there's a problem, how do you disable those internal optimizations if they do harm in some cases? (or let's say I want to play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean numa_buffers_interleave would be nice? c. What if I want my standby (walreceiver+startup/recovery) to run with NUMA affinity to get better performance (I'm not going to hack around systemd script every time, but I could imagine changing numa=X,Y,Z after restart/before promotion) d. Now if I would be forced for some reason to do that numactl(1) voodoo, and use the those above mentioned overrides and PG wouldn't be having GUC (let's say I would use `numactl --weighted-interleave=0,1`), then: > 2) v1-0002-NUMA-localalloc.patch > This simply sets "localalloc" when initializing a backend, so that all > memory allocated later is local, not interleaved. Initially this was > necessary because the patch set the allocation policy to interleaving > before initializing shared memory, and we didn't want to interleave the > private memory. But that's no longer the case - the explicit mapping to > nodes does not have this issue. I'm keeping the patch for convenience, > it allows experimenting with numactl etc. .. .is not accurate anymore and we would require to have that in (still with GUC) ? Thoughts? I can add that mine part into Your's patches if you want. Way too quick review and some very fast benchmark probes, I've concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt would be too new topic for me), but let's start: 1. normal pgbench -S (still with just s_b@4GB), done many tries, consistent benefit for the patch with like +8..10% boost on generic run: numa_buffers_interleave=off numa_pgproc_interleave=on(due that always on "if"), s_b just on 1 NUMA node (might happen) latency average = 0.373 ms latency stddev = 0.237 ms initial connection time = 45.899 ms tps = 160242.147877 (without initial connection time) numa_buffers_interleave=on numa_pgproc_interleave=on latency average = 0.345 ms latency stddev = 0.373 ms initial connection time = 44.485 ms tps = 177564.686094 (without initial connection time) 2. Tested it the same way as I did for mine(problem#2 from Andres's presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with seqconcurrscans.pgb as earlier) default/numa_buffers_interleave=off latency average = 1375.478 ms latency stddev = 1141.423 ms initial connection time = 46.104 ms tps = 45.868075 (without initial connection time) numa_buffers_interleave=on latency average = 838.128 ms latency stddev = 498.787 ms initial connection time = 43.437 ms tps = 75.413894 (without initial connection time) and i've repeated the the same test (identical conditions) with my patch, got me slightly more juice: latency average = 727.717 ms latency stddev = 410.767 ms initial connection time = 45.119 ms tps = 86.844161 (without initial connection time) (but mine didn't get that boost from normal pgbench as per #1 pgbench -S -- my numa='all' stays @ 160k TPS just as numa_buffers_interleave=off), so this idea is clearly better. So should I close https://commitfest.postgresql.org/patch/5703/ and you'll open a new one or should I just edit the #5703 and alter it and add this thread too? 3. Patch is not calling interleave on PQ shmem, do we want to add that in as some next item like v1-0007? Question is whether OS interleaving makes sense there ? I believe it does there, please see my thread (NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are being spawned by postmaster and may end up on different NUMA nodes randomly, so actually OS-interleaving that memory reduces jitter there (AKA bandwidth-over-latency). My thinking is that one cannot expect static/forced CPU-to-just-one-NUMA-node assignment for backend and it's PQ workers, because it is impossible have always available CPU power there in that NUMA node, so it might be useful to interleave that shared mem there too (as separate patch item?) 4 In BufferManagerShmemInit() you call numa_num_configured_nodes() (also in v1-0005). My worry is should we may put some known-limitations docs (?) from start and mention that if the VM is greatly resized and NUMA numa nodes appear, they might not be used until restart? 5. In v1-0001, pg_numa_interleave_memory() + * XXX no return value, to make this fail on error, has to use + * numa_set_strict Yes, my patch has those numa_error() and numa_warn() handlers too in pg_numa. Feel free to use it for better UX. + * XXX Should we still touch the memory first, like with numa_move_pages, + * or is that not necessary? It's not necessary to touch after numa_tonode_memory() (wrapper around numa_interleave_memory()), if it is going to be used anyway it will be correctly placed to best of my knowledge. 6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c Accidental indents (also fails to apply) 7. We miss the pg_numa_* shims, but for sure that's for later and also avoid those Linux specific #ifdef USE_LIBNUMA and so on? 8. v1-0005 2x + /* if (numa_procs_interleave) */ Ha! it's a TRAP! I've uncommented it because I wanted to try it out without it (just by setting GUC off) , but "MyProc->sema" is NULL : 2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit [..] 2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755) was terminated by signal 11: Segmentation fault 2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other active server processes 2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because "restart_after_crash" is off 2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down [New LWP 28755] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: io worker '. Program terminated with signal SIGSEGV, Segmentation fault. #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at ./nptl/sem_waitcommon.c:136 136 ./nptl/sem_waitcommon.c: No such file or directory. (gdb) where #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at ./nptl/sem_waitcommon.c:136 #1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81 #2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at ../src/backend/port/posix_sema.c:302 #3 0x0000556191970553 in InitAuxiliaryProcess () at ../src/backend/storage/lmgr/proc.c:992 #4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at ../src/backend/postmaster/auxprocess.c:65 #5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/storage/aio/method_worker.c:393 #6 0x00005561918e8163 in postmaster_child_launch (child_type=child_type@entry=B_IO_WORKER, child_slot=20086, startup_data=startup_data@entry=0x0, startup_data_len=startup_data_len@entry=0, client_sock=client_sock@entry=0x0) at ../src/backend/postmaster/launch_backend.c:290 #7 0x00005561918ea09a in StartChildProcess (type=type@entry=B_IO_WORKER) at ../src/backend/postmaster/postmaster.c:3973 #8 0x00005561918ea308 in maybe_adjust_io_workers () at ../src/backend/postmaster/postmaster.c:4404 [..] (gdb) print *MyProc->sem Cannot access memory at address 0x0 9. v1-0006: is this just a thought or serious candidate? I can imagine it can easily blow-up with some backends somehow requesting CPUs only from one NUMA node, while the second node being idle. Isn't it better just to leave CPU scheduling, well, to the CPU scheduler? The problem is that you have tools showing overall CPU usage, even mpstat(1) per CPU , but no tools for per-NUMA node CPU util%, so it would be hard for someone to realize that this is happening. -J. [1] - https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
On 7/4/25 13:05, Jakub Wartak wrote: > On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi! > >> 1) v1-0001-NUMA-interleaving-buffers.patch > [..] >> It's a bit more complicated, because the patch distributes both the >> blocks and descriptors, in the same way. So a buffer and it's descriptor >> always end on the same NUMA node. This is one of the reasons why we need >> to map larger chunks, because NUMA works on page granularity, and the >> descriptors are tiny - many fit on a memory page. > > Oh, now I get it! OK, let's stick to this one. > >> I don't think the splitting would actually make some things simpler, or >> maybe more flexible - in particular, it'd allow us to enable huge pages >> only for some regions (like shared buffers), and keep the small pages >> e.g. for PGPROC. So that'd be good. > > You have made assumption that this is good, but small pages (4KB) are > not hugetlb, and are *swappable* (Transparent HP are swappable too, > manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The > most frequent problem I see these days are OOMs, and it makes me > believe that making certain critical parts of shared memory being > swappable just to make pagesize granular is possibly throwing the baby > out with the bathwater. I'm thinking about bad situations like: some > wrong settings of vm.swapiness that people keep (or distros keep?) and > general inability of PG to restrain from allocating more memory in > some cases. > I haven't observed such issues myself, or maybe I didn't realize it's happening. Maybe it happens, but it'd be good to see some data showing that, or a reproducer of some sort. But let's say it's real. I don't think we should use huge pages merely to ensure something is not swapped out. The "not swappable" is more of a limitation of huge pages, not an advantage. You can't just choose to make them swappable. Wouldn't it be better to keep using 4KB pages, but lock the memory using mlock/mlockall? >> The other thing I haven't thought about very much is determining on >> which CPUs/nodes the instance is allowed to run. I assume we'd start by >> simply inherit/determine that at the start through libnuma, not through >> some custom PG configuration (which the patch [2] proposed to do). > > 0. I think that we could do better, some counter arguments to > no-configuration-at-all: > > a. as Robert & Bertrand already put it there after review: let's say I > want just to run on NUMA #2 node, so here I would need to override > systemd's script ExecStart= to include that numactl (not elegant?). I > could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even > less friendly. Also it probably requires root to edit/reload systemd, > while having GUC for this like in my proposal makes it more smooth (I > think?) > > b. wouldn't it be better if that stayed as drop-in rather than always > on? What if there's a problem, how do you disable those internal > optimizations if they do harm in some cases? (or let's say I want to > play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean > numa_buffers_interleave would be nice? > > c. What if I want my standby (walreceiver+startup/recovery) to run > with NUMA affinity to get better performance (I'm not going to hack > around systemd script every time, but I could imagine changing > numa=X,Y,Z after restart/before promotion) > > d. Now if I would be forced for some reason to do that numactl(1) > voodoo, and use the those above mentioned overrides and PG wouldn't be > having GUC (let's say I would use `numactl > --weighted-interleave=0,1`), then: > I'm not against doing something like this, but I don't plan to do that in V1. I don't have a clear idea what configurability is actually needed, so it's likely I'd do the interface wrong. >> 2) v1-0002-NUMA-localalloc.patch >> This simply sets "localalloc" when initializing a backend, so that all >> memory allocated later is local, not interleaved. Initially this was >> necessary because the patch set the allocation policy to interleaving >> before initializing shared memory, and we didn't want to interleave the >> private memory. But that's no longer the case - the explicit mapping to >> nodes does not have this issue. I'm keeping the patch for convenience, >> it allows experimenting with numactl etc. > > .. .is not accurate anymore and we would require to have that in > (still with GUC) ? > Thoughts? I can add that mine part into Your's patches if you want. > I'm sorry, I don't understand what's the question :-( > Way too quick review and some very fast benchmark probes, I've > concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt > would be too new topic for me), but let's start: > > 1. normal pgbench -S (still with just s_b@4GB), done many tries, > consistent benefit for the patch with like +8..10% boost on generic > run: > > numa_buffers_interleave=off numa_pgproc_interleave=on(due that > always on "if"), s_b just on 1 NUMA node (might happen) > latency average = 0.373 ms > latency stddev = 0.237 ms > initial connection time = 45.899 ms > tps = 160242.147877 (without initial connection time) > > numa_buffers_interleave=on numa_pgproc_interleave=on > latency average = 0.345 ms > latency stddev = 0.373 ms > initial connection time = 44.485 ms > tps = 177564.686094 (without initial connection time) > > 2. Tested it the same way as I did for mine(problem#2 from Andres's > presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with > seqconcurrscans.pgb as earlier) > default/numa_buffers_interleave=off > latency average = 1375.478 ms > latency stddev = 1141.423 ms > initial connection time = 46.104 ms > tps = 45.868075 (without initial connection time) > > numa_buffers_interleave=on > latency average = 838.128 ms > latency stddev = 498.787 ms > initial connection time = 43.437 ms > tps = 75.413894 (without initial connection time) > > and i've repeated the the same test (identical conditions) with my > patch, got me slightly more juice: > latency average = 727.717 ms > latency stddev = 410.767 ms > initial connection time = 45.119 ms > tps = 86.844161 (without initial connection time) > > (but mine didn't get that boost from normal pgbench as per #1 > pgbench -S -- my numa='all' stays @ 160k TPS just as > numa_buffers_interleave=off), so this idea is clearly better. Good, thanks for the testing. I should have done something like this when I posted my patches, but I forgot about that (and the email felt too long anyway). But this actually brings an interesting question. What exactly should we expect / demand from these patches? In my mind it'd primarily about predictability and stability of results. For example, the results should not depend on how was the database warmed up - was it done by a single backend or many backends? Was it restarted, or what? I could probably warmup the system very carefully to ensure it's balanced. The patches mean I don't need to be that careful. > So should I close https://commitfest.postgresql.org/patch/5703/ > and you'll open a new one or should I just edit the #5703 and alter it > and add this thread too? > Good question. It's probably best to close the original entry as "withdrawn" and I'll add a new entry. Sounds OK? > 3. Patch is not calling interleave on PQ shmem, do we want to add that > in as some next item like v1-0007? Question is whether OS interleaving > makes sense there ? I believe it does there, please see my thread > (NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are > being spawned by postmaster and may end up on different NUMA nodes > randomly, so actually OS-interleaving that memory reduces jitter there > (AKA bandwidth-over-latency). My thinking is that one cannot expect > static/forced CPU-to-just-one-NUMA-node assignment for backend and > it's PQ workers, because it is impossible have always available CPU > power there in that NUMA node, so it might be useful to interleave > that shared mem there too (as separate patch item?) > Excellent question. I haven't thought about this at all. I agree it probably makes sense to interleave this memory, in some way. I don't know what's the perfect scheme, though. wild idea: Would it make sense to pin the workers to the same NUMA node as the leader? And allocate all memory only from that node? > 4 In BufferManagerShmemInit() you call numa_num_configured_nodes() > (also in v1-0005). My worry is should we may put some > known-limitations docs (?) from start and mention that > if the VM is greatly resized and NUMA numa nodes appear, they might > not be used until restart? > Yes, this is one thing I need some feedback on. The patches mostly assume there are no disabled nodes, that the set of allowed nodes does not change, etc. I think for V1 that's a reasonable limitation. But let's say we want to relax this a bit. How do we learn about the change, after a node/CPU gets disabled? For some parts it's not that difficult (e.g. we can "remap" buffers/descriptors) in the background. But for other parts that's not practical. E.g. we can't rework how the PGPROC gets split. But while discussing this with Andres yesterday, he had an interesting suggestion - to always use e.g. 8 or 16 partitions, then partition this by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would go to node 0, 4-7 would go to node 1, etc. The advantage is that if a node gets disabled, we can rebuild just this small "mapping" and not the 16 partitions. And the partitioning may be helpful even without NUMA. Still have to figure out the details, but seems it might help. > 5. In v1-0001, pg_numa_interleave_memory() > > + * XXX no return value, to make this fail on error, has to use > + * numa_set_strict > > Yes, my patch has those numa_error() and numa_warn() handlers too in > pg_numa. Feel free to use it for better UX. > > + * XXX Should we still touch the memory first, like > with numa_move_pages, > + * or is that not necessary? > > It's not necessary to touch after numa_tonode_memory() (wrapper around > numa_interleave_memory()), if it is going to be used anyway it will be > correctly placed to best of my knowledge. > > 6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c > > Accidental indents (also fails to apply) > > 7. We miss the pg_numa_* shims, but for sure that's for later and also > avoid those Linux specific #ifdef USE_LIBNUMA and so on? > Right, we need to add those. Or actually, we need to think about how we'd do this for non-NUMA systems. I wonder if we even want to just build everything the "old way" (without the partitions, etc.). But per the earlier comment, the partitioning seems beneficial even on non-NUMA systems, so maybe the shims are good enough OK. > 8. v1-0005 2x + /* if (numa_procs_interleave) */ > > Ha! it's a TRAP! I've uncommented it because I wanted to try it out > without it (just by setting GUC off) , but "MyProc->sema" is NULL : > > 2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL > 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit > [..] > 2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755) > was terminated by signal 11: Segmentation fault > 2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other > active server processes > 2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because > "restart_after_crash" is off > 2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down > > [New LWP 28755] > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > Core was generated by `postgres: io worker '. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) > at ./nptl/sem_waitcommon.c:136 > 136 ./nptl/sem_waitcommon.c: No such file or directory. > (gdb) where > #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) > at ./nptl/sem_waitcommon.c:136 > #1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81 > #2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at > ../src/backend/port/posix_sema.c:302 > #3 0x0000556191970553 in InitAuxiliaryProcess () at > ../src/backend/storage/lmgr/proc.c:992 > #4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at > ../src/backend/postmaster/auxprocess.c:65 > #5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized > out>, startup_data_len=<optimized out>) at > ../src/backend/storage/aio/method_worker.c:393 > #6 0x00005561918e8163 in postmaster_child_launch > (child_type=child_type@entry=B_IO_WORKER, child_slot=20086, > startup_data=startup_data@entry=0x0, > startup_data_len=startup_data_len@entry=0, > client_sock=client_sock@entry=0x0) at > ../src/backend/postmaster/launch_backend.c:290 > #7 0x00005561918ea09a in StartChildProcess > (type=type@entry=B_IO_WORKER) at > ../src/backend/postmaster/postmaster.c:3973 > #8 0x00005561918ea308 in maybe_adjust_io_workers () at > ../src/backend/postmaster/postmaster.c:4404 > [..] > (gdb) print *MyProc->sem > Cannot access memory at address 0x0 > Yeah, good catch. I'll look into that next week. > 9. v1-0006: is this just a thought or serious candidate? I can imagine > it can easily blow-up with some backends somehow requesting CPUs only > from one NUMA node, while the second node being idle. Isn't it better > just to leave CPU scheduling, well, to the CPU scheduler? The problem > is that you have tools showing overall CPU usage, even mpstat(1) per > CPU , but no tools for per-NUMA node CPU util%, so it would be hard > for someone to realize that this is happening. > Mostly experimental, for benchmarking etc. I agree we may not want to mess with the task scheduling too much. Thanks for the feedback! regards -- Tomas Vondra
Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
От
Cédric Villemain
Дата:
Hi Tomas, I haven't yet had time to fully read all the work and proposals around NUMA and related features, but I hope to catch up over the summer. However, I think it's important to share some thoughts before it's too late, as you might find them relevant to the NUMA management code. > 6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch > > This is an experimental patch, that simply pins the new process to the > NUMA node obtained from the freelist. > > Driven by GUC "numa_procs_pin" (default: off). In my work on more careful PostgreSQL resource management, I've come to the conclusion that we should avoid pushing policy too deeply into the PostgreSQL core itself. Therefore, I'm quite skeptical about integrating NUMA-specific management directly into core PostgreSQL in such a way. We are working on a PROFILE and PROFILE MANAGER specification to provide PostgreSQL with only the APIs and hooks needed so that extensions can manage whatever they want externally. The basic syntax (not meant to be discussed here, and even the names might change) is roughly as follows, just to illustrate the intent: CREATE PROFILE MANAGER manager_name [IF NOT EXISTS] [ HANDLER handler_function | NO HANDLER ] [ VALIDATOR validator_function | NO VALIDATOR ] [ OPTIONS ( option 'value' [, ... ] ) ] CREATE PROFILE profile_name [IF NOT EXISTS] USING profile_manager SET key = value [, key = value]... [USING profile_manager SET key = value [, key = value]...] [...]; CREATE PROFILE MAPPING [IF NOT EXISTS] FOR PROFILE profile_name [MATCH [ ALL | ANY ] ( [ROLE role_name], [BACKEND TYPE backend_type], [DATABASE database_name], [APPLICATION appname] )]; ## PROFILE RESOLUTION ORDER 1. ALTER ROLE IN DATABASE 2. ALTER ROLE 3. ALTER DATABASE 4. First matching PROFILE MAPPING (global or specific) 5. No profile (fallback) As currently designed, this approach allows quite a lot of flexibility: * pg_psi is used to ensure the spec is suitable for a cgroup profile manager (moving PIDs as needed; NUMA and cgroups could work well together, see e.g. this Linux kernel summary: https://blogs.oracle.com/linux/post/numa-balancing ) * Someone else could implement support for Windows or BSD specifics. * Others might use it to integrate PostgreSQL's own resources (e.g., "areas" of shared buffers) into policies. Hope this perspective is helpful. Best regards, -- Cédric Villemain +33 6 20 30 22 52 https://www.Data-Bene.io PostgreSQL Support, Expertise, Training, R&D
On 7/5/25 09:09, Cédric Villemain wrote: > Hi Tomas, > > > I haven't yet had time to fully read all the work and proposals around > NUMA and related features, but I hope to catch up over the summer. > > However, I think it's important to share some thoughts before it's too > late, as you might find them relevant to the NUMA management code. > > >> 6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch >> >> This is an experimental patch, that simply pins the new process to the >> NUMA node obtained from the freelist. >> >> Driven by GUC "numa_procs_pin" (default: off). > > > In my work on more careful PostgreSQL resource management, I've come to > the conclusion that we should avoid pushing policy too deeply into the > PostgreSQL core itself. Therefore, I'm quite skeptical about integrating > NUMA-specific management directly into core PostgreSQL in such a way. > > > We are working on a PROFILE and PROFILE MANAGER specification to provide > PostgreSQL with only the APIs and hooks needed so that extensions can > manage whatever they want externally. > > The basic syntax (not meant to be discussed here, and even the names > might change) is roughly as follows, just to illustrate the intent: > > > CREATE PROFILE MANAGER manager_name [IF NOT EXISTS] > [ HANDLER handler_function | NO HANDLER ] > [ VALIDATOR validator_function | NO VALIDATOR ] > [ OPTIONS ( option 'value' [, ... ] ) ] > > CREATE PROFILE profile_name > [IF NOT EXISTS] > USING profile_manager > SET key = value [, key = value]... > [USING profile_manager > SET key = value [, key = value]...] > [...]; > > CREATE PROFILE MAPPING > [IF NOT EXISTS] > FOR PROFILE profile_name > [MATCH [ ALL | ANY ] ( > [ROLE role_name], > [BACKEND TYPE backend_type], > [DATABASE database_name], > [APPLICATION appname] > )]; > > ## PROFILE RESOLUTION ORDER > > 1. ALTER ROLE IN DATABASE > 2. ALTER ROLE > 3. ALTER DATABASE > 4. First matching PROFILE MAPPING (global or specific) > 5. No profile (fallback) > > As currently designed, this approach allows quite a lot of flexibility: > > * pg_psi is used to ensure the spec is suitable for a cgroup profile > manager (moving PIDs as needed; NUMA and cgroups could work well > together, see e.g. this Linux kernel summary: https://blogs.oracle.com/ > linux/post/numa-balancing ) > > * Someone else could implement support for Windows or BSD specifics. > > * Others might use it to integrate PostgreSQL's own resources (e.g., > "areas" of shared buffers) into policies. > > Hope this perspective is helpful. Can you explain how you want to manage this by an extension defined at the SQL level, when most of this stuff has to be done when setting up shared memory, which is waaaay before we have any access to catalogs? regards -- Tomas Vondra
Hi Tomas, some more thoughts after the weekend: On Fri, Jul 4, 2025 at 8:12 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 7/4/25 13:05, Jakub Wartak wrote: > > On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote: > > > > Hi! > > > >> 1) v1-0001-NUMA-interleaving-buffers.patch > > [..] > >> It's a bit more complicated, because the patch distributes both the > >> blocks and descriptors, in the same way. So a buffer and it's descriptor > >> always end on the same NUMA node. This is one of the reasons why we need > >> to map larger chunks, because NUMA works on page granularity, and the > >> descriptors are tiny - many fit on a memory page. > > > > Oh, now I get it! OK, let's stick to this one. > > > >> I don't think the splitting would actually make some things simpler, or > >> maybe more flexible - in particular, it'd allow us to enable huge pages > >> only for some regions (like shared buffers), and keep the small pages > >> e.g. for PGPROC. So that'd be good. > > > > You have made assumption that this is good, but small pages (4KB) are > > not hugetlb, and are *swappable* (Transparent HP are swappable too, > > manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The > > most frequent problem I see these days are OOMs, and it makes me > > believe that making certain critical parts of shared memory being > > swappable just to make pagesize granular is possibly throwing the baby > > out with the bathwater. I'm thinking about bad situations like: some > > wrong settings of vm.swapiness that people keep (or distros keep?) and > > general inability of PG to restrain from allocating more memory in > > some cases. > > > > I haven't observed such issues myself, or maybe I didn't realize it's > happening. Maybe it happens, but it'd be good to see some data showing > that, or a reproducer of some sort. But let's say it's real. > > I don't think we should use huge pages merely to ensure something is not > swapped out. The "not swappable" is more of a limitation of huge pages, > not an advantage. You can't just choose to make them swappable. > > Wouldn't it be better to keep using 4KB pages, but lock the memory using > mlock/mlockall? In my book, not being swappable is a win (it's hard for me to imagine when it could be beneficial to swap out parts of s_b). I was trying to think about it and also got those: Anyway mlock() probably sounds like it, but e.g. Rocky 8.10 by default has max locked memory (ulimit -l) as low as 64kB due to systemd's DefaultLimitMEMLOCK, but Debian/Ubuntu have those at higher values. Wasn't expecting that - those are bizzare low values. I think we would need something like (10000*900)/1024/1024 or more, but with each PGPROC on a separate page that would be even way more? Another thing with 4kB pages: there's this big assumption now made that once we arrive in InitProcess() we won't ever change NUMA node, so we stick to the PGPROC from where we started (based on getcpu(2)). Let's assume CPU scheduler reassigned us to differnt node, but we have now this 4kB patch ready for PGPROC in theory and this means we would need to rely on the NUMA autobalancing doing it's job to migrate that 4kB page from node to node (to get better local accesses instead of remote ones). The questions in my head are now like that: - but we have asked intially asked those PGPROC pages to be localized on certain node (they have policy), so they won't autobalance? We would need to somewhere call getcpu() again notice the difference and unlocalize (clear the NUMA/mbind() policy) for the PGPROC page? - mlocked() as above says stick to physical RAM page (?) , so it won't move? - after what time kernel's autobalancing would migrate that page since switching the active CPU<->node? I mean do we execute enough reads on this page? BTW: to move this into pragmatic real, what's the most one-liner/trivial way to exercise/stress PGPROC? > >> The other thing I haven't thought about very much is determining on > >> which CPUs/nodes the instance is allowed to run. I assume we'd start by > >> simply inherit/determine that at the start through libnuma, not through > >> some custom PG configuration (which the patch [2] proposed to do). > > > > 0. I think that we could do better, some counter arguments to > > no-configuration-at-all: > > > > a. as Robert & Bertrand already put it there after review: let's say I > > want just to run on NUMA #2 node, so here I would need to override > > systemd's script ExecStart= to include that numactl (not elegant?). I > > could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even > > less friendly. Also it probably requires root to edit/reload systemd, > > while having GUC for this like in my proposal makes it more smooth (I > > think?) > > > > b. wouldn't it be better if that stayed as drop-in rather than always > > on? What if there's a problem, how do you disable those internal > > optimizations if they do harm in some cases? (or let's say I want to > > play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean > > numa_buffers_interleave would be nice? > > > > c. What if I want my standby (walreceiver+startup/recovery) to run > > with NUMA affinity to get better performance (I'm not going to hack > > around systemd script every time, but I could imagine changing > > numa=X,Y,Z after restart/before promotion) > > > > d. Now if I would be forced for some reason to do that numactl(1) > > voodoo, and use the those above mentioned overrides and PG wouldn't be > > having GUC (let's say I would use `numactl > > --weighted-interleave=0,1`), then: > > > > I'm not against doing something like this, but I don't plan to do that > in V1. I don't have a clear idea what configurability is actually > needed, so it's likely I'd do the interface wrong. > > >> 2) v1-0002-NUMA-localalloc.patch > >> This simply sets "localalloc" when initializing a backend, so that all > >> memory allocated later is local, not interleaved. Initially this was > >> necessary because the patch set the allocation policy to interleaving > >> before initializing shared memory, and we didn't want to interleave the > >> private memory. But that's no longer the case - the explicit mapping to > >> nodes does not have this issue. I'm keeping the patch for convenience, > >> it allows experimenting with numactl etc. > > > > .. .is not accurate anymore and we would require to have that in > > (still with GUC) ? > > Thoughts? I can add that mine part into Your's patches if you want. > > > > I'm sorry, I don't understand what's the question :-( That patch reference above, it was a chain of thought from step "d". What I had in mind was that you cannot remove the patch `v1-0002-NUMA-localalloc.patch` from the scope if forcing people to use numactl by not having enough configurability on the PG side. That is: if someone will have to use systemd+numactl --interleave/--weighted-interleave then, he will also need to have a way to use numa_localalloc=on (to override the new/user's policy default, otherwise local mem allocations are also going to be interleaved, and we are back to square one). Which brings me to a point why instead of this toggle, should include the configuration properly inside from start (it's not that hard apparently). > > Way too quick review and some very fast benchmark probes, I've > > concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt > > would be too new topic for me), but let's start: > > > > 1. normal pgbench -S (still with just s_b@4GB), done many tries, > > consistent benefit for the patch with like +8..10% boost on generic > > run: > > [.. removed numbers] > > But this actually brings an interesting question. What exactly should we > expect / demand from these patches? In my mind it'd primarily about > predictability and stability of results. > > For example, the results should not depend on how was the database > warmed up - was it done by a single backend or many backends? Was it > restarted, or what? I could probably warmup the system very carefully to > ensure it's balanced. The patches mean I don't need to be that careful. Well, pretty much the same here. I was after minimizing "stddev" (to have better predictability of results, especially across restarts) and increasing available bandwidth [which is pretty much related]. Without our NUMA work, PG can just put that s_b on any random node or spill randomly from to another (depending on size of allocation request). > > So should I close https://commitfest.postgresql.org/patch/5703/ > > and you'll open a new one or should I just edit the #5703 and alter it > > and add this thread too? > > > > Good question. It's probably best to close the original entry as > "withdrawn" and I'll add a new entry. Sounds OK? Sure thing, marked it as `Returned with feedback`, this approach seems to be much more advanced. > > 3. Patch is not calling interleave on PQ shmem, do we want to add that > > in as some next item like v1-0007? Question is whether OS interleaving > > makes sense there ? I believe it does there, please see my thread > > (NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are > > being spawned by postmaster and may end up on different NUMA nodes > > randomly, so actually OS-interleaving that memory reduces jitter there > > (AKA bandwidth-over-latency). My thinking is that one cannot expect > > static/forced CPU-to-just-one-NUMA-node assignment for backend and > > it's PQ workers, because it is impossible have always available CPU > > power there in that NUMA node, so it might be useful to interleave > > that shared mem there too (as separate patch item?) > > > > Excellent question. I haven't thought about this at all. I agree it > probably makes sense to interleave this memory, in some way. I don't > know what's the perfect scheme, though. > > wild idea: Would it make sense to pin the workers to the same NUMA node > as the leader? And allocate all memory only from that node? I'm trying to convey exactly the opposite message or at least that it might depend on configuration. Please see https://www.postgresql.org/message-id/CAKZiRmxYMPbQ4WiyJWh%3DVuw_Ny%2BhLGH9_9FaacKRJvzZ-smm%2Bw%40mail.gmail.com (btw it should read there that I don't indent spend a lot of thime on PQ), but anyway: I think we should NOT pin the PQ workers the same NODE as you do not know if there's CPU left there (same story as with v1-0006 here). I'm just proposing quick OS-based interleaving of PQ shm if using all nodes, literally: @@ -334,6 +336,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size, } *mapped_address = address; *mapped_size = request_size; + + /* We interleave memory only at creation time. */ + if (op == DSM_OP_CREATE && numa->setting > NUMA_OFF) { + elog(DEBUG1, "interleaving shm mem @ %p size=%zu", *mapped_address, *mapped_size); + pg_numa_interleave_memptr(*mapped_address, *mapped_size, numa->nodes); + } + Because then if memory is interleaved you have probably less variance for memory access. But also from that previous thread: "So if anything: - latency-wise: it would be best to place leader+all PQ workers close to s_b, provided s_b fits NUMA shared/huge page memory there and you won't need more CPU than there's on that NUMA node... (assuming e.g. hosting 4 DBs on 4-sockets each on it's own, it would be best to pin everything including shm, but PQ workers too) - capacity/TPS-wise or s_b > NUMA: just interleave to maximize bandwidth and get uniform CPU performance out of this" So wild idea was: maybe PQ shm interleaving should on NUMA configuration (if intereavling to all nodes, then interleave normally, but if configuration sets to just 1 NUMA node, it automatically binds there -- there was '@' support for that in my patch). > > 4 In BufferManagerShmemInit() you call numa_num_configured_nodes() > > (also in v1-0005). My worry is should we may put some > > known-limitations docs (?) from start and mention that > > if the VM is greatly resized and NUMA numa nodes appear, they might > > not be used until restart? > > > > Yes, this is one thing I need some feedback on. The patches mostly > assume there are no disabled nodes, that the set of allowed nodes does > not change, etc. I think for V1 that's a reasonable limitation. Sure! > But let's say we want to relax this a bit. How do we learn about the > change, after a node/CPU gets disabled? For some parts it's not that > difficult (e.g. we can "remap" buffers/descriptors) in the background. > But for other parts that's not practical. E.g. we can't rework how the > PGPROC gets split. > > But while discussing this with Andres yesterday, he had an interesting > suggestion - to always use e.g. 8 or 16 partitions, then partition this > by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would > go to node 0, 4-7 would go to node 1, etc. The advantage is that if a > node gets disabled, we can rebuild just this small "mapping" and not the > 16 partitions. And the partitioning may be helpful even without NUMA. > > Still have to figure out the details, but seems it might help. Right, no idea how the shared_memory remapping patch will work (how/when the s_b change will be executed), but we could somehow mark that number of NUMA zones could be rechecked during SIGHUP (?) and then just simple compare check if old_numa_num_configured_nodes == new_numa_num_configured_nodes is true. Anyway, I think it's way too advanced for now, don't you think? (like CPU ballooning [s_b itself] is rare, and NUMA ballooning seems to be super-wild-rare). As for the rest, forgot to include this too: getcpu() - this really needs a portable pg_getcpu() wrapper. -J.
Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
От
Cédric Villemain
Дата:
>> * Others might use it to integrate PostgreSQL's own resources (e.g., >> "areas" of shared buffers) into policies. >> >> Hope this perspective is helpful. > > Can you explain how you want to manage this by an extension defined at > the SQL level, when most of this stuff has to be done when setting up > shared memory, which is waaaay before we have any access to catalogs? I should have said module instead, I didn't follow carefully but at some point there were discussion about shared buffers resized "on-line". Anyway, it was just to give some few examples, maybe this one is to be considered later (I'm focused on cgroup/psi, and precisely reassigning PIDs as needed). -- Cédric Villemain +33 6 20 30 22 52 https://www.Data-Bene.io PostgreSQL Support, Expertise, Training, R&D
On 7/7/25 16:51, Cédric Villemain wrote: >>> * Others might use it to integrate PostgreSQL's own resources (e.g., >>> "areas" of shared buffers) into policies. >>> >>> Hope this perspective is helpful. >> >> Can you explain how you want to manage this by an extension defined at >> the SQL level, when most of this stuff has to be done when setting up >> shared memory, which is waaaay before we have any access to catalogs? > > I should have said module instead, I didn't follow carefully but at some > point there were discussion about shared buffers resized "on-line". > Anyway, it was just to give some few examples, maybe this one is to be > considered later (I'm focused on cgroup/psi, and precisely reassigning > PIDs as needed). > I don't know. I have a hard time imagining what exactly would the policies / profiles do exactly to respond to changes in the system utilization. And why should that interfere with this patch ... The main thing patch series aims to implement is partitioning different pieces of shared memory (buffers, freelists, ...) to better work for NUMA. I don't think there's that many ways to do this, and I doubt it makes sense to make this easily customizable from external modules of any kind. I can imagine providing some API allowing to isolate the instance on selected NUMA nodes, but that's about it. Yes, there's some relation to the online resizing of shared buffers, in which case we need to "refresh" some of the information. But AFAICS it's not very extensive (on top of what already needs to happen after the resize), and it'd happen within the boundaries of the partitioning scheme. There's not that much flexibility. The last bit (pinning backends to a NUMA node) is experimental, and mostly intended for easier evaluation of the earlier parts (e.g. to limit the noise when processes get moved to a CPU from a different NUMA node, and so on). regards -- Tomas Vondra
Hi, On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote: > In my work on more careful PostgreSQL resource management, I've come to the > conclusion that we should avoid pushing policy too deeply into the > PostgreSQL core itself. Therefore, I'm quite skeptical about integrating > NUMA-specific management directly into core PostgreSQL in such a way. I think it's actually the opposite - whenever we pushed stuff like this outside of core it has hurt postgres substantially. Not having replication in core was a huge mistake. Not having HA management in core is probably the biggest current adoption hurdle for postgres. To deal better with NUMA we need to improve memory placement and various algorithms, in an interrelated way - that's pretty much impossible to do outside of core. Greetings, Andres Freund
Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
От
Cédric Villemain
Дата:
> On 7/7/25 16:51, Cédric Villemain wrote: >>>> * Others might use it to integrate PostgreSQL's own resources (e.g., >>>> "areas" of shared buffers) into policies. >>>> >>>> Hope this perspective is helpful. >>> >>> Can you explain how you want to manage this by an extension defined at >>> the SQL level, when most of this stuff has to be done when setting up >>> shared memory, which is waaaay before we have any access to catalogs? >> >> I should have said module instead, I didn't follow carefully but at some >> point there were discussion about shared buffers resized "on-line". >> Anyway, it was just to give some few examples, maybe this one is to be >> considered later (I'm focused on cgroup/psi, and precisely reassigning >> PIDs as needed). >> > > I don't know. I have a hard time imagining what exactly would the > policies / profiles do exactly to respond to changes in the system > utilization. And why should that interfere with this patch ... > > The main thing patch series aims to implement is partitioning different > pieces of shared memory (buffers, freelists, ...) to better work for > NUMA. I don't think there's that many ways to do this, and I doubt it > makes sense to make this easily customizable from external modules of > any kind. I can imagine providing some API allowing to isolate the > instance on selected NUMA nodes, but that's about it. > > Yes, there's some relation to the online resizing of shared buffers, in > which case we need to "refresh" some of the information. But AFAICS it's > not very extensive (on top of what already needs to happen after the > resize), and it'd happen within the boundaries of the partitioning > scheme. There's not that much flexibility. > > The last bit (pinning backends to a NUMA node) is experimental, and > mostly intended for easier evaluation of the earlier parts (e.g. to > limit the noise when processes get moved to a CPU from a different NUMA > node, and so on). The backend pinning can be done by replacing your patch on proc.c to call an external profile manager doing exactly the same thing maybe ? Similar to: pmroutine = GetPmRoutineForInitProcess(); if (pmroutine != NULL && pmroutine->init_process != NULL) pmroutine->init_process(MyProc); ... pmroutine = GetPmRoutineForInitAuxilliary(); if (pmroutine != NULL && pmroutine->init_auxilliary != NULL) pmroutine->init_auxilliary(MyProc); Added on some rare places should cover most if not all the requirement around process placement (process_shared_preload_libraries() is called earlier in the process creation I believe). -- Cédric Villemain +33 6 20 30 22 52 https://www.Data-Bene.io PostgreSQL Support, Expertise, Training, R&D
Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
От
Cédric Villemain
Дата:
Hi Andres, > Hi, > > On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote: >> In my work on more careful PostgreSQL resource management, I've come to the >> conclusion that we should avoid pushing policy too deeply into the >> PostgreSQL core itself. Therefore, I'm quite skeptical about integrating >> NUMA-specific management directly into core PostgreSQL in such a way. > > I think it's actually the opposite - whenever we pushed stuff like this > outside of core it has hurt postgres substantially. Not having replication in > core was a huge mistake. Not having HA management in core is probably the > biggest current adoption hurdle for postgres. > > To deal better with NUMA we need to improve memory placement and various > algorithms, in an interrelated way - that's pretty much impossible to do > outside of core. Except the backend pinning which is easy to achieve, thus my comment on the related patch. I'm not claiming NUMA memory and all should be managed outside of core (though I didn't read other patches yet). -- Cédric Villemain +33 6 20 30 22 52 https://www.Data-Bene.io PostgreSQL Support, Expertise, Training, R&D
Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
От
Cédric Villemain
Дата:
>> On 7/7/25 16:51, Cédric Villemain wrote: >>>>> * Others might use it to integrate PostgreSQL's own resources (e.g., >>>>> "areas" of shared buffers) into policies. >>>>> >>>>> Hope this perspective is helpful. >>>> >>>> Can you explain how you want to manage this by an extension defined at >>>> the SQL level, when most of this stuff has to be done when setting up >>>> shared memory, which is waaaay before we have any access to catalogs? >>> >>> I should have said module instead, I didn't follow carefully but at some >>> point there were discussion about shared buffers resized "on-line". >>> Anyway, it was just to give some few examples, maybe this one is to be >>> considered later (I'm focused on cgroup/psi, and precisely reassigning >>> PIDs as needed). >>> >> >> I don't know. I have a hard time imagining what exactly would the >> policies / profiles do exactly to respond to changes in the system >> utilization. And why should that interfere with this patch ... >> >> The main thing patch series aims to implement is partitioning different >> pieces of shared memory (buffers, freelists, ...) to better work for >> NUMA. I don't think there's that many ways to do this, and I doubt it >> makes sense to make this easily customizable from external modules of >> any kind. I can imagine providing some API allowing to isolate the >> instance on selected NUMA nodes, but that's about it. >> >> Yes, there's some relation to the online resizing of shared buffers, in >> which case we need to "refresh" some of the information. But AFAICS it's >> not very extensive (on top of what already needs to happen after the >> resize), and it'd happen within the boundaries of the partitioning >> scheme. There's not that much flexibility. >> >> The last bit (pinning backends to a NUMA node) is experimental, and >> mostly intended for easier evaluation of the earlier parts (e.g. to >> limit the noise when processes get moved to a CPU from a different NUMA >> node, and so on). > > The backend pinning can be done by replacing your patch on proc.c to > call an external profile manager doing exactly the same thing maybe ? > > Similar to: > pmroutine = GetPmRoutineForInitProcess(); > if (pmroutine != NULL && > pmroutine->init_process != NULL) > pmroutine->init_process(MyProc); > > ... > > pmroutine = GetPmRoutineForInitAuxilliary(); > if (pmroutine != NULL && > pmroutine->init_auxilliary != NULL) > pmroutine->init_auxilliary(MyProc); > > Added on some rare places should cover most if not all the requirement > around process placement (process_shared_preload_libraries() is called > earlier in the process creation I believe). > After a first read I think this works for patches 002 and 005. For this last one, InitProcGlobal() may setup things as you do but then expose the choice a bit later, basically in places where you added the if condition on the GUC: numa_procs_interleave). -- Cédric Villemain +33 6 20 30 22 52 https://www.Data-Bene.io PostgreSQL Support, Expertise, Training, R&D
Hi, On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote: > On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote: > > I don't think the splitting would actually make some things simpler, or > > maybe more flexible - in particular, it'd allow us to enable huge pages > > only for some regions (like shared buffers), and keep the small pages > > e.g. for PGPROC. So that'd be good. > > You have made assumption that this is good, but small pages (4KB) are > not hugetlb, and are *swappable* (Transparent HP are swappable too, > manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The > most frequent problem I see these days are OOMs, and it makes me > believe that making certain critical parts of shared memory being > swappable just to make pagesize granular is possibly throwing the baby > out with the bathwater. I'm thinking about bad situations like: some > wrong settings of vm.swapiness that people keep (or distros keep?) and > general inability of PG to restrain from allocating more memory in > some cases. The reason it would be advantageous to put something like the procarray onto smaller pages is that otherwise the entire procarray (unless particularly large) ends up on a single NUMA node, increasing the latency for backends on every other numa node and increasing memory traffic on that node. Greetings, Andres Freund
On 7/8/25 05:04, Andres Freund wrote: > Hi, > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote: >> On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote: >>> I don't think the splitting would actually make some things simpler, or >>> maybe more flexible - in particular, it'd allow us to enable huge pages >>> only for some regions (like shared buffers), and keep the small pages >>> e.g. for PGPROC. So that'd be good. >> >> You have made assumption that this is good, but small pages (4KB) are >> not hugetlb, and are *swappable* (Transparent HP are swappable too, >> manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The >> most frequent problem I see these days are OOMs, and it makes me >> believe that making certain critical parts of shared memory being >> swappable just to make pagesize granular is possibly throwing the baby >> out with the bathwater. I'm thinking about bad situations like: some >> wrong settings of vm.swapiness that people keep (or distros keep?) and >> general inability of PG to restrain from allocating more memory in >> some cases. > > The reason it would be advantageous to put something like the procarray onto > smaller pages is that otherwise the entire procarray (unless particularly > large) ends up on a single NUMA node, increasing the latency for backends on > every other numa node and increasing memory traffic on that node. > That's why the patch series splits the procarray into multiple pieces, so that it can be properly distributed on multiple NUMA nodes even with huge pages. It requires adjusting a couple places accessing the entries, but it surprised me how limited the impact was. If we could selectively use 4KB pages for parts of the shared memory, maybe this wouldn't be necessary. But it's not too annoying. The thing I'm not sure about is how much this actually helps with the traffic between node. Sure, if we pick a PGPROC from the same node, and the task does not get moved, it'll be local traffic. But if the task moves, there'll be traffic. I don't have any estimates how often this happens, e.g. for older tasks. regards -- Tomas Vondra
On 7/8/25 03:55, Cédric Villemain wrote: > Hi Andres, > >> Hi, >> >> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote: >>> In my work on more careful PostgreSQL resource management, I've come >>> to the >>> conclusion that we should avoid pushing policy too deeply into the >>> PostgreSQL core itself. Therefore, I'm quite skeptical about integrating >>> NUMA-specific management directly into core PostgreSQL in such a way. >> >> I think it's actually the opposite - whenever we pushed stuff like this >> outside of core it has hurt postgres substantially. Not having >> replication in >> core was a huge mistake. Not having HA management in core is probably the >> biggest current adoption hurdle for postgres. >> >> To deal better with NUMA we need to improve memory placement and various >> algorithms, in an interrelated way - that's pretty much impossible to do >> outside of core. > > Except the backend pinning which is easy to achieve, thus my comment on > the related patch. > I'm not claiming NUMA memory and all should be managed outside of core > (though I didn't read other patches yet). > But an "optimal backend placement" seems to very much depend on where we placed the various pieces of shared memory. Which the external module will have trouble following, I suspect. I still don't have any idea what exactly would the external module do, how would it decide where to place the backend. Can you describe some use case with an example? Assuming we want to actually pin tasks from within Postgres, what I think might work is allowing modules to "advise" on where to place the task. But the decision would still be done by core. regards -- Tomas Vondra
Hi, On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote: > On 7/8/25 05:04, Andres Freund wrote: > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote: > > The reason it would be advantageous to put something like the procarray onto > > smaller pages is that otherwise the entire procarray (unless particularly > > large) ends up on a single NUMA node, increasing the latency for backends on > > every other numa node and increasing memory traffic on that node. > > > > That's why the patch series splits the procarray into multiple pieces, > so that it can be properly distributed on multiple NUMA nodes even with > huge pages. It requires adjusting a couple places accessing the entries, > but it surprised me how limited the impact was. Sure, you can do that, but it does mean that iterations over the procarray now have an added level of indirection... > The thing I'm not sure about is how much this actually helps with the > traffic between node. Sure, if we pick a PGPROC from the same node, and > the task does not get moved, it'll be local traffic. But if the task > moves, there'll be traffic. I don't have any estimates how often this > happens, e.g. for older tasks. I think the most important bit is to not put everything onto one numa node, otherwise the chance of increased latency for *everyone* due to the increased memory contention is more likely to hurt. Greetings, Andres Freund
Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
От
Cédric Villemain
Дата:
> On 7/8/25 03:55, Cédric Villemain wrote: >> Hi Andres, >> >>> Hi, >>> >>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote: >>>> In my work on more careful PostgreSQL resource management, I've come >>>> to the >>>> conclusion that we should avoid pushing policy too deeply into the >>>> PostgreSQL core itself. Therefore, I'm quite skeptical about integrating >>>> NUMA-specific management directly into core PostgreSQL in such a way. >>> >>> I think it's actually the opposite - whenever we pushed stuff like this >>> outside of core it has hurt postgres substantially. Not having >>> replication in >>> core was a huge mistake. Not having HA management in core is probably the >>> biggest current adoption hurdle for postgres. >>> >>> To deal better with NUMA we need to improve memory placement and various >>> algorithms, in an interrelated way - that's pretty much impossible to do >>> outside of core. >> >> Except the backend pinning which is easy to achieve, thus my comment on >> the related patch. >> I'm not claiming NUMA memory and all should be managed outside of core >> (though I didn't read other patches yet). >> > > But an "optimal backend placement" seems to very much depend on where we > placed the various pieces of shared memory. Which the external module > will have trouble following, I suspect. > > I still don't have any idea what exactly would the external module do, > how would it decide where to place the backend. Can you describe some > use case with an example? > > Assuming we want to actually pin tasks from within Postgres, what I > think might work is allowing modules to "advise" on where to place the > task. But the decision would still be done by core. Possibly exactly what you're doing in proc.c when managing allocation of process, but not hardcoded in postgresql (patches 02, 05 and 06 are good candidates), I didn't get that they require information not available to any process executing code from a module. Parts of your code where you assign/define policy could be in one or more relevant routines of a "numa profile manager", like in an initProcessRoutine(), and registered in pmroutine struct: pmroutine = GetPmRoutineForInitProcess(); if (pmroutine != NULL && pmroutine->init_process != NULL) pmroutine->init_process(MyProc); This way it's easier to manage alternative policies, and also to be able to adjust when hardware and linux kernel changes. -- Cédric Villemain +33 6 20 30 22 52 https://www.Data-Bene.io PostgreSQL Support, Expertise, Training, R&D
On 7/8/25 18:06, Cédric Villemain wrote: > > > > > > >> On 7/8/25 03:55, Cédric Villemain wrote: >>> Hi Andres, >>> >>>> Hi, >>>> >>>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote: >>>>> In my work on more careful PostgreSQL resource management, I've come >>>>> to the >>>>> conclusion that we should avoid pushing policy too deeply into the >>>>> PostgreSQL core itself. Therefore, I'm quite skeptical about >>>>> integrating >>>>> NUMA-specific management directly into core PostgreSQL in such a way. >>>> >>>> I think it's actually the opposite - whenever we pushed stuff like this >>>> outside of core it has hurt postgres substantially. Not having >>>> replication in >>>> core was a huge mistake. Not having HA management in core is >>>> probably the >>>> biggest current adoption hurdle for postgres. >>>> >>>> To deal better with NUMA we need to improve memory placement and >>>> various >>>> algorithms, in an interrelated way - that's pretty much impossible >>>> to do >>>> outside of core. >>> >>> Except the backend pinning which is easy to achieve, thus my comment on >>> the related patch. >>> I'm not claiming NUMA memory and all should be managed outside of core >>> (though I didn't read other patches yet). >>> >> >> But an "optimal backend placement" seems to very much depend on where we >> placed the various pieces of shared memory. Which the external module >> will have trouble following, I suspect. >> >> I still don't have any idea what exactly would the external module do, >> how would it decide where to place the backend. Can you describe some >> use case with an example? >> >> Assuming we want to actually pin tasks from within Postgres, what I >> think might work is allowing modules to "advise" on where to place the >> task. But the decision would still be done by core. > > Possibly exactly what you're doing in proc.c when managing allocation of > process, but not hardcoded in postgresql (patches 02, 05 and 06 are good > candidates), I didn't get that they require information not available to > any process executing code from a module. > Well, it needs to understand how some other stuff (especially PGPROC entries) is distributed between nodes. I'm not sure how much of this internal information we want to expose outside core ... > Parts of your code where you assign/define policy could be in one or > more relevant routines of a "numa profile manager", like in an > initProcessRoutine(), and registered in pmroutine struct: > > pmroutine = GetPmRoutineForInitProcess(); > if (pmroutine != NULL && > pmroutine->init_process != NULL) > pmroutine->init_process(MyProc); > > This way it's easier to manage alternative policies, and also to be able > to adjust when hardware and linux kernel changes. > I'm not against making this extensible, in some way. But I still struggle to imagine a reasonable alternative policy, where the external module gets the same information and ends up with a different decision. So what would the alternate policy look like? What use case would the module be supporting? regards -- Tomas Vondra
Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
От
Cédric Villemain
Дата:
> On 7/8/25 18:06, Cédric Villemain wrote: >> >> >> >> >> >> >>> On 7/8/25 03:55, Cédric Villemain wrote: >>>> Hi Andres, >>>> >>>>> Hi, >>>>> >>>>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote: >>>>>> In my work on more careful PostgreSQL resource management, I've come >>>>>> to the >>>>>> conclusion that we should avoid pushing policy too deeply into the >>>>>> PostgreSQL core itself. Therefore, I'm quite skeptical about >>>>>> integrating >>>>>> NUMA-specific management directly into core PostgreSQL in such a way. >>>>> >>>>> I think it's actually the opposite - whenever we pushed stuff like this >>>>> outside of core it has hurt postgres substantially. Not having >>>>> replication in >>>>> core was a huge mistake. Not having HA management in core is >>>>> probably the >>>>> biggest current adoption hurdle for postgres. >>>>> >>>>> To deal better with NUMA we need to improve memory placement and >>>>> various >>>>> algorithms, in an interrelated way - that's pretty much impossible >>>>> to do >>>>> outside of core. >>>> >>>> Except the backend pinning which is easy to achieve, thus my comment on >>>> the related patch. >>>> I'm not claiming NUMA memory and all should be managed outside of core >>>> (though I didn't read other patches yet). >>>> >>> >>> But an "optimal backend placement" seems to very much depend on where we >>> placed the various pieces of shared memory. Which the external module >>> will have trouble following, I suspect. >>> >>> I still don't have any idea what exactly would the external module do, >>> how would it decide where to place the backend. Can you describe some >>> use case with an example? >>> >>> Assuming we want to actually pin tasks from within Postgres, what I >>> think might work is allowing modules to "advise" on where to place the >>> task. But the decision would still be done by core. >> >> Possibly exactly what you're doing in proc.c when managing allocation of >> process, but not hardcoded in postgresql (patches 02, 05 and 06 are good >> candidates), I didn't get that they require information not available to >> any process executing code from a module. >> > > Well, it needs to understand how some other stuff (especially PGPROC > entries) is distributed between nodes. I'm not sure how much of this > internal information we want to expose outside core ... > >> Parts of your code where you assign/define policy could be in one or >> more relevant routines of a "numa profile manager", like in an >> initProcessRoutine(), and registered in pmroutine struct: >> >> pmroutine = GetPmRoutineForInitProcess(); >> if (pmroutine != NULL && >> pmroutine->init_process != NULL) >> pmroutine->init_process(MyProc); >> >> This way it's easier to manage alternative policies, and also to be able >> to adjust when hardware and linux kernel changes. >> > > I'm not against making this extensible, in some way. But I still > struggle to imagine a reasonable alternative policy, where the external > module gets the same information and ends up with a different decision. > > So what would the alternate policy look like? What use case would the > module be supporting? That's the whole point: there are very distinct usages of PostgreSQL in the field. And maybe not all of them will require the policy defined by PostgreSQL core. May I ask the reverse: what prevent external modules from taking those decisions ? There are already a lot of area where external code can take over PostgreSQL processing, like Neon is doing. There are some very early processing for memory setup that I can see as a current blocker, and here I'd refer a more compliant NUMA api as proposed by Jakub so it's possible to arrange based on workload, hardware configuration or other matters. Reworking to get distinct segment and all as you do is great, and combo of both approach probably of great interest. There is also this weighted interleave discussed and probably much more to come in this area in Linux. I think some points raised already about possible distinct policies, I am precisely claiming that it is hard to come with one good policy with limited setup options, thus requirement to keep that flexible enough (hooks, api, 100 GUc ?). There is an EPYC story here also, given the NUMA setup can vary depending on BIOS setup, associated NUMA policy must probably take that into account (L3 can be either real cache or 4 extra "local" NUMA nodes - with highly distinct access cost from a RAM module). Does that change how PostgreSQL will place memory and process? Is it important or of interest ? -- Cédric Villemain +33 6 20 30 22 52 https://www.Data-Bene.io PostgreSQL Support, Expertise, Training, R&D
Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
От
Bertrand Drouvot
Дата:
Hi, On Wed, Jul 09, 2025 at 06:40:00AM +0000, Cédric Villemain wrote: > > On 7/8/25 18:06, Cédric Villemain wrote: > > I'm not against making this extensible, in some way. But I still > > struggle to imagine a reasonable alternative policy, where the external > > module gets the same information and ends up with a different decision. > > > > So what would the alternate policy look like? What use case would the > > module be supporting? > > > That's the whole point: there are very distinct usages of PostgreSQL in the > field. And maybe not all of them will require the policy defined by > PostgreSQL core. > > May I ask the reverse: what prevent external modules from taking those > decisions ? There are already a lot of area where external code can take > over PostgreSQL processing, like Neon is doing. > > There are some very early processing for memory setup that I can see as a > current blocker, and here I'd refer a more compliant NUMA api as proposed by > Jakub so it's possible to arrange based on workload, hardware configuration > or other matters. Reworking to get distinct segment and all as you do is > great, and combo of both approach probably of great interest. I think that Tomas's approach helps to have more "predictable" performance expectations, I mean more consistent over time, fewer "surprises". While your approach (and Jakub's one)) could help to get performance gains depending on a "known" context (so less generic). So, probably having both could make sense but I think that they serve different purposes. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote: > > On 7/8/25 05:04, Andres Freund wrote: > > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote: > > > The reason it would be advantageous to put something like the procarray onto > > > smaller pages is that otherwise the entire procarray (unless particularly > > > large) ends up on a single NUMA node, increasing the latency for backends on > > > every other numa node and increasing memory traffic on that node. > > > Sure thing, I fully understand the motivation and underlying reason (without claiming that I understand the exact memory access patterns that involve procarray/PGPROC/etc and hotspots involved from PG side). Any single-liner pgbench help for how to really easily stress the PGPROC or procarray? > > That's why the patch series splits the procarray into multiple pieces, > > so that it can be properly distributed on multiple NUMA nodes even with > > huge pages. It requires adjusting a couple places accessing the entries, > > but it surprised me how limited the impact was. Yes, and we are discussing if it is worth getting into smaller pages for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or what more even more waste 1GB hugetlb if we dont request 2MB for some small structs: btw, we have ability to select MAP_HUGE_2MB vs MAP_HUGE_1GB). I'm thinking about two problems: - 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning - using libnuma often leads to MPOL_BIND which disarms NUMA autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and since 5.12+ kernel they can take additional flag MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move memory anyway (if way too many tasks are relocated, so would be memory?). It is available only in recent libnuma as numa_set_membind_balancing(3), but sadly there's no way via libnuma to do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED? (select one node from many for each node while still allowing balancing?), but in [1][2] (2024) it is stated that "It's not legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but maybe stuff has been improved since then. Something like: PGPROC/procarray 2MB page for node#1 - mbind(addr1, MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]); PGPROC/procarray 2MB page for node#2 - mbind(addr2, MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]); > Sure, you can do that, but it does mean that iterations over the procarray now > have an added level of indirection... So the most efficient would be the old-way (no indirections) vs NUMA-way? Can this be done without #ifdefs at all? > > The thing I'm not sure about is how much this actually helps with the > > traffic between node. Sure, if we pick a PGPROC from the same node, and > > the task does not get moved, it'll be local traffic. But if the task > > moves, there'll be traffic. With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case? > > I don't have any estimates how often this happens, e.g. for older tasks. We could measure, kernel 6.16+ has per PID numa_task_migrated in /proc/{PID}/sched , but I assume we would have to throw backends >> VCPUs at it, to simulate reality and do some "waves" between different activity periods of certain pools (I can imagine worst case scenario: a) pgbench "a" open $VCPU connections, all idle, with scripto to sleep for a while b) pgbench "b" open some $VCPU new connections to some other DB, all active from start (tpcbb or readonly) c) manually ping CPUs using taskset for each PID all from "b" to specific NUMA node #2 -- just to simulate unfortunate app working on every 2nd conn d) pgbench "a" starts working and hits CPU imbalance -- e.g. NUMA node #1 is idle, #2 is full, CPU scheduler starts puting "a" backends on CPUs from #1 , and we should notice PIDs being migrated) > I think the most important bit is to not put everything onto one numa node, > otherwise the chance of increased latency for *everyone* due to the increased > memory contention is more likely to hurt. -J. p.s. I hope i did write in an understandable way, because I had many interruptions, so if anything is unclear please let me know. [1] - https://lkml.org/lkml/2024/7/3/352 [2] - https://lkml.rescloud.iu.edu/2402.2/03227.html
Hi, On 2025-07-02 14:36:31 +0200, Tomas Vondra wrote: > On 7/2/25 13:37, Ashutosh Bapat wrote: > > On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote: > >> > >> > >> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch > >> > >> Minor optimization. Andres noticed we're tracking the tail of buffer > >> freelist, without using it. So the patch removes that. > >> > > > > The patches for resizing buffers use the lastFreeBuffer to add new > > buffers to the end of free list when expanding it. But we could as > > well add it at the beginning of the free list. Yea, I don't see any point in adding buffers to the tail instead of to the front. We probably want more recently used buffers at the front, since they (and the associated BufferDesc) are more likely to be in a CPU cache. > > This patch seems almost independent of the rest of the patches. Do you > > need it in the rest of the patches? I understand that those patches > > don't need to worry about maintaining lastFreeBuffer after this patch. > > Is there any other effect? > > > > If we are going to do this, let's do it earlier so that buffer > > resizing patches can be adjusted. > > > > My patches don't particularly rely on this bit, it would work even with > lastFreeBuffer. I believe Andres simply noticed the current code does > not use lastFreeBuffer, it just maintains is, so he removed that as an > optimization. Optimiziation / simplification. When building multiple freelists it was harder to maintain the tail pointer, and since it was never used... +1 to just applying that part. > I don't know how significant is the improvement, but if it's measurable we > could just do that independently of our patches. I doubt it's really an improvement in any realistic scenario, but it's also not a regression in any way, since it's never used... FWIW, I've started to wonder if we shouldn't just get rid of the freelist entirely. While clocksweep is perhaps minutely slower in a single thread than the freelist, clock sweep scales *considerably* better [1]. As it's rather rare to be bottlenecked on clock sweep speed for a single thread (rather then IO or memory copy overhead), I think it's worth favoring clock sweep. Also needing to switch between getting buffers from the freelist and the sweep makes the code more expensive. I think just having the buffer in the sweep, with a refcount / usagecount of zero would suffice. That seems particularly advantageous if we invest energy in making the clock sweep deal well with NUMA systems, because we don't need have both a NUMA aware freelist and a NUMA aware clock sweep. Greetings, Andres Freund [1] A single pg_prewarm of a large relation shows a difference between using the freelist and not that's around the noise level, whereas 40 parallel pg_prewarms of seperate relations is over 5x faster when disabling the freelist. For the test: - I modified pg_buffercache_evict_* to put buffers onto the freelist - Ensured all of shared buffers is allocated by querying pg_shmem_allocations_numa, as otherwise the workload is dominated by the kernel zeroing out buffers - used shared_buffers bigger than the data - data for single threaded is 9.7GB, data for the parallel case is 40 relations of 610MB each. - in the single threaded case I pinned postgres to a single core, to make sure core-to-core variation doesn't play a role - single threaded case c=1 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_large');")-t1 concurrent case: c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_:client_id');")-t1
On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote: > FWIW, I've started to wonder if we shouldn't just get rid of the freelist > entirely. While clocksweep is perhaps minutely slower in a single > thread than > the freelist, clock sweep scales *considerably* better [1]. As it's rather > rare to be bottlenecked on clock sweep speed for a single thread > (rather then > IO or memory copy overhead), I think it's worth favoring clock sweep. Hey Andres, thanks for spending time on this. I've worked before on freelist implementations (last one in LMDB) and I think you're onto something. I think it's an innovative idea and that the speed difference will either be lost in the noise or potentially entirely mitigated by avoiding duplicate work. > Also needing to switch between getting buffers from the freelist and > the sweep > makes the code more expensive. I think just having the buffer in the sweep, > with a refcount / usagecount of zero would suffice. If you're not already coding this, I'll jump in. :) > That seems particularly advantageous if we invest energy in making the clock > sweep deal well with NUMA systems, because we don't need have both a NUMA > aware freelist and a NUMA aware clock sweep. 100% agree here, very clever approach adapting clock sweep to a NUMA world. best. -greg > > Greetings, > > Andres Freund
Hi, On 2025-07-09 12:04:00 +0200, Jakub Wartak wrote: > On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres@anarazel.de> wrote: > > On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote: > > > On 7/8/25 05:04, Andres Freund wrote: > > > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote: > > > > The reason it would be advantageous to put something like the procarray onto > > > > smaller pages is that otherwise the entire procarray (unless particularly > > > > large) ends up on a single NUMA node, increasing the latency for backends on > > > > every other numa node and increasing memory traffic on that node. > > > > > > Sure thing, I fully understand the motivation and underlying reason > (without claiming that I understand the exact memory access patterns > that involve procarray/PGPROC/etc and hotspots involved from PG side). > Any single-liner pgbench help for how to really easily stress the > PGPROC or procarray? Unfortunately it's probably going to be slightly more complicated workloads that show the effect - the very simplest cases don't go iterate through the procarray itself anymore. > > > That's why the patch series splits the procarray into multiple pieces, > > > so that it can be properly distributed on multiple NUMA nodes even with > > > huge pages. It requires adjusting a couple places accessing the entries, > > > but it surprised me how limited the impact was. > > Yes, and we are discussing if it is worth getting into smaller pages > for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or > what more even more waste 1GB hugetlb if we dont request 2MB for some > small structs: btw, we have ability to select MAP_HUGE_2MB vs > MAP_HUGE_1GB). I'm thinking about two problems: > - 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning I'm not really bought into this being a problem. If your system has enough pressure to swap out the PGPROC array, you're so hosed that this won't make a difference. > - using libnuma often leads to MPOL_BIND which disarms NUMA > autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and > since 5.12+ kernel they can take additional flag > MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move > memory anyway (if way too many tasks are relocated, so would be > memory?). It is available only in recent libnuma as > numa_set_membind_balancing(3), but sadly there's no way via libnuma to > do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it > would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED? > (select one node from many for each node while still allowing > balancing?), but in [1][2] (2024) it is stated that "It's not > legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but > maybe stuff has been improved since then. > > Something like: > PGPROC/procarray 2MB page for node#1 - mbind(addr1, > MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]); > PGPROC/procarray 2MB page for node#2 - mbind(addr2, > MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]); I'm rather doubtful that it's a good idea to combine numa awareness with numa balancing. Numa balancing adds latency and makes it much more expensive for userspace to act in a numa aware way, since it needs to regularly update its knowledge about where memory resides. > > Sure, you can do that, but it does mean that iterations over the procarray now > > have an added level of indirection... > > So the most efficient would be the old-way (no indirections) vs > NUMA-way? Can this be done without #ifdefs at all? If we used 4k pages for the procarray we would just have ~4 procs on one page, if that range were marked as interleaved, it'd probably suffice. > > > The thing I'm not sure about is how much this actually helps with the > > > traffic between node. Sure, if we pick a PGPROC from the same node, and > > > the task does not get moved, it'll be local traffic. But if the task > > > moves, there'll be traffic. > > With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case? I doubt that NUMA balancing is going to help a whole lot here, there are too many procs on one page for that to be helpful. One thing that might be worth doing is to *increase* the size of PGPROC by moving other pieces of data that are keyed by ProcNumber into PGPROC. I think the main thing to avoid is the case where all of PGPROC, buffer mapping table, ... resides on one NUMA node (e.g. because it's the one postmaster was scheduled on), as the increased memory traffic will lead to queries on that node being slower than the other node. Greetings, Andres Freund
Hi, On 2025-07-09 12:55:51 -0400, Greg Burd wrote: > On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote: > > > FWIW, I've started to wonder if we shouldn't just get rid of the freelist > > entirely. While clocksweep is perhaps minutely slower in a single > > thread than > > the freelist, clock sweep scales *considerably* better [1]. As it's rather > > rare to be bottlenecked on clock sweep speed for a single thread > > (rather then > > IO or memory copy overhead), I think it's worth favoring clock sweep. > > Hey Andres, thanks for spending time on this. I've worked before on > freelist implementations (last one in LMDB) and I think you're onto > something. I think it's an innovative idea and that the speed > difference will either be lost in the noise or potentially entirely > mitigated by avoiding duplicate work. Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE perform better because it doesn't need to maintain the freelist anymore... > > Also needing to switch between getting buffers from the freelist and > > the sweep > > makes the code more expensive. I think just having the buffer in the sweep, > > with a refcount / usagecount of zero would suffice. > > If you're not already coding this, I'll jump in. :) My experimental patch is literally a four character addition ;), namely adding "0 &&" to the relevant code in StrategyGetBuffer(). Obviously a real patch would need to do some more work than that. Feel free to take on that project, I am not planning on tackling that in near term. There's other things around this that could use some attention. It's not hard to see clock sweep be a bottleneck in concurrent workloads - partially due to the shared maintenance of the clock hand. A NUMAed clock sweep would address that. However, we also maintain StrategyControl->numBufferAllocs, which is a significant contention point and would not necessarily be removed by a NUMAificiation of the clock sweep. Greetings, Andres Freund
Hi, On 2025-07-08 16:06:00 +0000, Cédric Villemain wrote: > > Assuming we want to actually pin tasks from within Postgres, what I > > think might work is allowing modules to "advise" on where to place the > > task. But the decision would still be done by core. > > Possibly exactly what you're doing in proc.c when managing allocation of > process, but not hardcoded in postgresql (patches 02, 05 and 06 are good > candidates), I didn't get that they require information not available to any > process executing code from a module. > Parts of your code where you assign/define policy could be in one or more > relevant routines of a "numa profile manager", like in an > initProcessRoutine(), and registered in pmroutine struct: > > pmroutine = GetPmRoutineForInitProcess(); > if (pmroutine != NULL && > pmroutine->init_process != NULL) > pmroutine->init_process(MyProc); > > This way it's easier to manage alternative policies, and also to be able to > adjust when hardware and linux kernel changes. I am doubtful this makes sense - as you can see patch 05 needs to change a fair bit of core code to make this work, there's no way we can delegate much of that to an extension. But even if it's doable, I think it's *very* premature to focus on such extensibility at this point - we need to get the basics into a mergeable state, if you then want to argue for adding extensibility, we can do that at this stage. Trying to design this for extensibility from the get go, where that extensibility is very unlikely to be used widely, seems rather likely to just tank this entire project without getting us anything in return. Greetings, Andres Freund
Hi, Thanks for working on this! I think it's an area we have long neglected... On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote: > Each patch has a numa_ GUC, intended to enable/disable that part. This > is meant to make development easier, not as a final interface. I'm not > sure how exactly that should look. It's possible some combinations of > GUCs won't work, etc. Wonder if some of it might be worth putting into a multi-valued GUC (like debug_io_direct). > 1) v1-0001-NUMA-interleaving-buffers.patch > > This is the main thing when people think about NUMA - making sure the > shared buffers are allocated evenly on all the nodes, not just on a > single node (which can happen easily with warmup). The regular memory > interleaving would address this, but it also has some disadvantages. > > Firstly, it's oblivious to the contents of the shared memory segment, > and we may not want to interleave everything. It's also oblivious to > alignment of the items (a buffer can easily end up "split" on multiple > NUMA nodes), or relationship between different parts (e.g. there's a > BufferBlock and a related BufferDescriptor, and those might again end up > on different nodes). Two more disadvantages: With OS interleaving postgres doesn't (not easily at least) know about what maps to what, which means postgres can't do stuff like numa aware buffer replacement. With OS interleaving the interleaving is "too fine grained", with pages being mapped at each page boundary, making it less likely for things like one strategy ringbuffer to reside on a single numa node. I wonder if we should *increase* the size of shared_buffers whenever huge pages are in use and there's padding space due to the huge page boundaries. Pretty pointless to waste that memory if we can instead use if for the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge pages... > 4) v1-0004-NUMA-partition-buffer-freelist.patch > > Right now we have a single freelist, and in busy instances that can be > quite contended. What's worse, the freelist may trash between different > CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on > subsets of buffers. The patch implements multiple strategies how the > list can be split (configured using "numa_partition_freelist" GUC), for > experimenting: > > * node - One list per NUMA node. This is the most natural option, > because we now know which buffer is on which node, so we can ensure a > list for a node only has buffers from that list. > > * cpu - One list per CPU. Pretty simple, each CPU gets it's own list. > > * pid - Similar to "cpu", but the processes are mapped to lists based on > PID, not CPU ID. > > * none - nothing, sigle freelist > > Ultimately, I think we'll want to go with "node", simply because it > aligns with the buffer interleaving. But there are improvements needed. I think we might eventually want something more granular than just "node" - the freelist (and the clock sweep) can become a contention point even within one NUMA node. I'm imagining something like an array of freelists/clocksweep states, where the current numa node selects a subset of the array and the cpu is used to choose the entry within that list. But we can do that later, that should be a fairly simple extension of what you're doing. > The other missing part is clocksweep - there's still just a single > instance of clocksweep, feeding buffers to all the freelists. But that's > clearly a problem, because the clocksweep returns buffers from all NUMA > nodes. The clocksweep really needs to be partitioned the same way as a > freelists, and each partition will operate on a subset of buffers (from > the right NUMA node). > > I do have a separate experimental patch doing something like that, I > need to make it part of this branch. I'm really curious about that patch, as I wrote elsewhere in this thread, I think we should just get rid of the freelist alltogether. Even if we don't do so, in a steady state system the clock sweep is commonly much more important than the freelist... > 5) v1-0005-NUMA-interleave-PGPROC-entries.patch > > Another area that seems like it might benefit from NUMA is PGPROC, so I > gave it a try. It turned out somewhat challenging. Similarly to buffers > we have two pieces that need to be located in a coordinated way - PGPROC > entries and fast-path arrays. But we can't use the same approach as for > buffers/descriptors, because > > (a) Neither of those pieces aligns with memory page size (PGPROC is > ~900B, fast-path arrays are variable length). > (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require > rather high max_connections before we use multiple huge pages. We should probably pad them regardless? Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common cache line size), but that hasn't always been the case, and isn't the case on systems with 128 bit cachelines like common ARMv8 systems. And having one cacheline hold one backends fast path states and another backend's xmin doesn't sound like a recipe for good performance. Seems like we should also do some reordering of the contents within PGPROC. We have e.g. have very frequently changing data (->waitStatus, ->lwWaiting) in the same caceheline as almost immutable data (->pid, ->pgxactoff, ->databaseId,). > So what I did instead is splitting the whole PGPROC array into one array > per NUMA node, and one array for auxiliary processes and 2PC xacts. So > with 4 NUMA nodes there are 5 separate arrays, for example. Each array > is a multiple of memory pages, so we may waste some of the memory. But > that's simply how NUMA works - page granularity. Theoretically we could use the "padding" memory at the end of each NUMA node's PGPROC array to for the 2PC entries, for those we presumably don't care for locality. Not sure it's worth the complexity though. For a while I thought I had a better solution: Given that we're going to waste all the "padding" memory, why not just oversize the PGPROC array so that it spans the required number of NUMA nodes? The problem is that that would lead to ProcNumbers to get much larger, and we do have other arrays that are keyed by ProcNumber. Which probably makes this not so great an idea. > This however makes one particular thing harder - in a couple places we > accessed PGPROC entries through PROC_HDR->allProcs, which was pretty > much just one large array. And GetNumberFromPGProc() relied on array > arithmetics to determine procnumber. With the array partitioned, this > can't work the same way. > > But there's a simple solution - if we turn allProcs into an array of > *pointers* to PGPROC arrays, there's no issue. All the places need a > pointer anyway. And then we need an explicit procnumber field in PGPROC, > instead of calculating it. > > There's a chance this have negative impact on code that accessed PGPROC > very often, but so far I haven't seen such cases. But if you can come up > with such examples, I'd like to see those. I'd not be surprised if there were overhead, adding a level of indirection to things like ProcArrayGroupClearXid(), GetVirtualXIDsDelayingChkpt(), SignalVirtualTransaction() probably won't be free. BUT: For at least some of these a better answer might be to add additional "dense" arrays like we have for xids etc, so they don't need to trawl through PGPROCs. > There's another detail - when obtaining a PGPROC entry in InitProcess(), > we try to get an entry from the same NUMA node. And only if that doesn't > work, we grab the first one from the list (there's still just one PGPROC > freelist, I haven't split that - maybe we should?). I guess it might be worth partitioning the freelist, iterating through a few thousand links just to discover that there's no free proc on the current numa node, while holding a spinlock, doesn't sound great. Even if it's likely rarely a huge issue compared to other costs. > The other thing I haven't thought about very much is determining on > which CPUs/nodes the instance is allowed to run. I assume we'd start by > simply inherit/determine that at the start through libnuma, not through > some custom PG configuration (which the patch [2] proposed to do). That seems like the right thing to me. One thing that this patchset afaict doesn't address so far is that there is a fair bit of other important shared memory that this patch doesn't set up intelligently e.g. the buffer mapping table itself (but there are loads of other cases). Because we touch a lot of that memory during startup, most it will be allocated on whatever NUMA node postmaster was scheduled. I suspect that the best we can do for parts of shared memory where we don't have explicit NUMA awareness is to default to an interleave policy. > From 9712e50d6d15c18ea2c5fcf457972486b0d4ef53 Mon Sep 17 00:00:00 2001 > From: Tomas Vondra <tomas@vondra.me> > Date: Tue, 6 May 2025 21:12:21 +0200 > Subject: [PATCH v1 1/6] NUMA: interleaving buffers > > Ensure shared buffers are allocated from all NUMA nodes, in a balanced > way, instead of just using the node where Postgres initially starts, or > where the kernel decides to migrate the page, etc. With pre-warming > performed by a single backend, this can easily result in severely > unbalanced memory distribution (with most from a single NUMA node). > > The kernel would eventually move some of the memory to other nodes > (thanks to zone_reclaim), but that tends to take a long time. So this > patch improves predictability, reduces the time needed for warmup > during benchmarking, etc. It's less dependent on what the CPU > scheduler does, etc. FWIW, I don't think zone_reclaim_mode will commonly do that? Even if enabled, which I don't think it is anymore by default. At least huge pages can't be reclaimed by the kernel, but even when not using huge pages, I think the only scenario where that would happen is if shared_buffers were swapped out. Numa balancing might eventually "fix" such an imbalance though. > diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c > index ed1dc488a42..2ad34624c49 100644 > --- a/src/backend/storage/buffer/buf_init.c > +++ b/src/backend/storage/buffer/buf_init.c > @@ -14,9 +14,17 @@ > */ > #include "postgres.h" > > +#ifdef USE_LIBNUMA > +#include <numa.h> > +#include <numaif.h> > +#endif > + I wonder how much of this we should try to put into port/pg_numa.c. Having direct calls to libnuma code all over the backend will make it rather hard to add numa awareness for hypothetical platforms not using libnuma compatible interfaces. > +/* number of buffers allocated on the same NUMA node */ > +static int64 numa_chunk_buffers = -1; Given that NBuffers is a 32bit quantity, this probably doesn't need to be 64bit... Anyway, I'm not going to review on that level going forward, the patch is probably in too early a state for that. > @@ -71,18 +92,80 @@ BufferManagerShmemInit(void) > foundDescs, > foundIOCV, > foundBufCkpt; > + Size mem_page_size; > + Size buffer_align; > + > + /* > + * XXX A bit weird. Do we need to worry about postmaster? Could this even > + * run outside postmaster? I don't think so. It can run in single user mode - but that shouldn't prevent us from using pg_get_shmem_pagesize(). > + * XXX Another issue is we may get different values than when sizing the > + * the memory, because at that point we didn't know if we get huge pages, > + * so we assumed we will. Shouldn't cause crashes, but we might allocate > + * shared memory and then not use some of it (because of the alignment > + * that we don't actually need). Not sure about better way, good for now. > + */ Ugh, not seeing a great way to deal with that either. > + * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to > + * align to mem_page_size? Especially for very large huge pages (e.g. 1GB) > + * that doesn't seem quite worth it. Maybe we should simply align to > + * BLCKSZ, so that buffers don't get split? Still, we might interfere with > + * other stuff stored in shared memory that we want to allocate on a > + * particular NUMA node (e.g. ProcArray). > + * > + * XXX Maybe with "too large" huge pages we should just not do this, or > + * maybe do this only for sufficiently large areas (e.g. shared buffers, > + * but not ProcArray). I think that's right - there's no point in using 1GB pages for anything other than shared_buffers, we should allocate shared_buffers separately. > +/* > + * Determine the size of memory page. > + * > + * XXX This is a bit tricky, because the result depends at which point we call > + * this. Before the allocation we don't know if we succeed in allocating huge > + * pages - but we have to size everything for the chance that we will. And then > + * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory > + * pages. But at that point we can't adjust the sizing. > + * > + * XXX Maybe with huge_pages=try we should do the sizing twice - first with > + * huge pages, and if that fails, then without them. But not for this patch. > + * Up to this point there was no such dependency on huge pages. Doing it twice sounds somewhat nasty - but perhaps we could just have the shmem size infrastructure compute two different numbers, one for use with huge pages and one without? > +static int64 > +choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes) > +{ > + int64 num_items; > + int64 max_items; > + > + /* make sure the chunks will align nicely */ > + Assert(BLCKSZ % sizeof(BufferDescPadded) == 0); > + Assert(mem_page_size % sizeof(BufferDescPadded) == 0); > + Assert(((BLCKSZ % mem_page_size) == 0) || ((mem_page_size % BLCKSZ) == 0)); > + > + /* > + * The minimum number of items to fill a memory page with descriptors and > + * blocks. The NUMA allocates memory in pages, and we need to do that for > + * both buffers and descriptors. > + * > + * In practice the BLCKSZ doesn't really matter, because it's much larger > + * than BufferDescPadded, so the result is determined buffer descriptors. > + * But it's clearer this way. > + */ > + num_items = Max(mem_page_size / sizeof(BufferDescPadded), > + mem_page_size / BLCKSZ); > + > + /* > + * We shouldn't use chunks larger than NBuffers/num_nodes, because with > + * larger chunks the last NUMA node would end up with much less memory (or > + * no memory at all). > + */ > + max_items = (NBuffers / num_nodes); > + > + /* > + * Did we already exceed the maximum desirable chunk size? That is, will > + * the last node get less than one whole chunk (or no memory at all)? > + */ > + if (num_items > max_items) > + elog(WARNING, "choose_chunk_buffers: chunk items exceeds max (%ld > %ld)", > + num_items, max_items); > + > + /* grow the chunk size until we hit the max limit. */ > + while (2 * num_items <= max_items) > + num_items *= 2; Something around this logic leads to a fair bit of imbalance - I started postgres with huge_page_size=1GB, shared_buffers=4GB on a 2 node system and that results in postgres[4188255][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors'); ┌────────────────────┬───────────┬────────────┐ │ name │ numa_node │ size │ ├────────────────────┼───────────┼────────────┤ │ Buffer Blocks │ 0 │ 5368709120 │ │ Buffer Blocks │ 1 │ 1073741824 │ │ Buffer Descriptors │ 0 │ 1073741824 │ │ Buffer Descriptors │ 1 │ 1073741824 │ └────────────────────┴───────────┴────────────┘ (4 rows) With shared_buffers=8GB postgres failed to start, even though 16 1GB huge pages are available, as 18GB were requested. After increasing the limit, the top allocations were as follows: postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations ORDER BY allocated_size DESC LIMIT 5; ┌──────────────────────┬─────────────┬────────────┬────────────────┐ │ name │ off │ size │ allocated_size │ ├──────────────────────┼─────────────┼────────────┼────────────────┤ │ Buffer Blocks │ 1192223104 │ 9663676416 │ 9663676416 │ │ PGPROC structures │ 10970279808 │ 3221733342 │ 3221733376 │ │ Fast-Path Lock Array │ 14192013184 │ 3221396544 │ 3221396608 │ │ Buffer Descriptors │ 51372416 │ 1140850688 │ 1140850688 │ │ (null) │ 17468590976 │ 785020032 │ 785020032 │ └──────────────────────┴─────────────┴────────────┴────────────────┘ With a fair bit of imbalance: postgres[4189384][1]=# SELECT * FROM pg_shmem_allocations_numa WHERE name in ('Buffer Blocks', 'Buffer Descriptors'); ┌────────────────────┬───────────┬────────────┐ │ name │ numa_node │ size │ ├────────────────────┼───────────┼────────────┤ │ Buffer Blocks │ 0 │ 8589934592 │ │ Buffer Blocks │ 1 │ 2147483648 │ │ Buffer Descriptors │ 0 │ 0 │ │ Buffer Descriptors │ 1 │ 2147483648 │ └────────────────────┴───────────┴────────────┘ (4 rows) Note that the buffer descriptors are all on node 1. > +/* > + * Calculate the NUMA node for a given buffer. > + */ > +int > +BufferGetNode(Buffer buffer) > +{ > + /* not NUMA interleaving */ > + if (numa_chunk_buffers == -1) > + return -1; > + > + return (buffer / numa_chunk_buffers) % numa_nodes; > +} FWIW, this is likely rather expensive - when not a compile time constant, divisions and modulo can take a fair number of cycles. > +/* > + * pg_numa_interleave_memory > + * move memory to different NUMA nodes in larger chunks > + * > + * startptr - start of the region (should be aligned to page size) > + * endptr - end of the region (doesn't need to be aligned) > + * mem_page_size - size of the memory page size > + * chunk_size - size of the chunk to move to a single node (should be multiple > + * of page size > + * num_nodes - number of nodes to allocate memory to > + * > + * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead? > + * That might be more efficient than numa_move_pages, as it works on larger > + * chunks of memory, not individual system pages, I think. > + * > + * XXX The "interleave" name is not quite accurate, I guess. > + */ > +static void > +pg_numa_interleave_memory(char *startptr, char *endptr, > + Size mem_page_size, Size chunk_size, > + int num_nodes) > +{ Seems like this should be in pg_numa.c? > diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c > index 69b6a877dc9..c07de903f76 100644 > --- a/src/bin/pgbench/pgbench.c > +++ b/src/bin/pgbench/pgbench.c I assume those changes weren't intentionally part of this patch... > From 6505848ac8359c8c76dfbffc7150b6601ab07601 Mon Sep 17 00:00:00 2001 > From: Tomas Vondra <tomas@vondra.me> > Date: Thu, 22 May 2025 18:38:41 +0200 > Subject: [PATCH v1 4/6] NUMA: partition buffer freelist > > Instead of a single buffer freelist, partition into multiple smaller > lists, to reduce lock contention, and to spread the buffers over all > NUMA nodes more evenly. > > There are four strategies, specified by GUC numa_partition_freelist > > * none - single long freelist, should work just like now > > * node - one freelist per NUMA node, with only buffers from that node > > * cpu - one freelist per CPU > > * pid - freelist determined by PID (same number of freelists as 'cpu') > > When allocating a buffer, it's taken from the correct freelist (e.g. > same NUMA node). > > Note: This is (probably) more important than partitioning ProcArray. > > +/* > + * Represents one freelist partition. > + */ > +typedef struct BufferStrategyFreelist > +{ > + /* Spinlock: protects the values below */ > + slock_t freelist_lock; > + > + /* > + * XXX Not sure why this needs to be aligned like this. Need to ask > + * Andres. > + */ > + int firstFreeBuffer __attribute__((aligned(64))); /* Head of list of > + * unused buffers */ > + > + /* Number of buffers consumed from this list. */ > + uint64 consumed; > +} BufferStrategyFreelist; I think this might be a leftover from measuring performance of a *non* partitioned freelist. I saw unnecessar contention between BufferStrategyControl->{nextVictimBuffer,buffer_strategy_lock,numBufferAllocs} and was testing what effect the simplest avoidance scheme has. I don't this should be part of this patchset. > > /* > * The shared freelist control information. > @@ -39,8 +66,6 @@ typedef struct > */ > pg_atomic_uint32 nextVictimBuffer; > > - int firstFreeBuffer; /* Head of list of unused buffers */ > - > /* > * Statistics. These counters should be wide enough that they can't > * overflow during a single bgwriter cycle. > @@ -51,13 +76,27 @@ typedef struct > /* > * Bgworker process to be notified upon activity or -1 if none. See > * StrategyNotifyBgWriter. > + * > + * XXX Not sure why this needs to be aligned like this. Need to ask > + * Andres. Also, shouldn't the alignment be specified after, like for > + * "consumed"? > */ > - int bgwprocno; > + int __attribute__((aligned(64))) bgwprocno; > + > + BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER]; > } BufferStrategyControl; Here the reason was that it's silly to put almost-readonly data (like bgwprocno) onto the same cacheline as very frequently modified data like ->numBufferAllocs. That causes unnecessary cache misses in many StrategyGetBuffer() calls, as another backend's StrategyGetBuffer() will always have modified ->numBufferAllocs and either ->buffer_strategy_lock or ->nextVictimBuffer. > > +static BufferStrategyFreelist * > +ChooseFreeList(void) > +{ > + unsigned cpu; > + unsigned node; > + int rc; > + > + int freelist_idx; > + > + /* freelist not partitioned, return the first (and only) freelist */ > + if (numa_partition_freelist == FREELIST_PARTITION_NONE) > + return &StrategyControl->freelists[0]; > + > + /* > + * freelist is partitioned, so determine the CPU/NUMA node, and pick a > + * list based on that. > + */ > + rc = getcpu(&cpu, &node); > + if (rc != 0) > + elog(ERROR, "getcpu failed: %m"); Probably should put this into somewhere abstracted away... > + /* > + * Pick the freelist, based on CPU, NUMA node or process PID. This matches > + * how we built the freelists above. > + * > + * XXX Can we rely on some of the values (especially strategy_nnodes) to > + * be a power-of-2? Then we could replace the modulo with a mask, which is > + * likely more efficient. > + */ > + switch (numa_partition_freelist) > + { > + case FREELIST_PARTITION_CPU: > + freelist_idx = cpu % strategy_ncpus; As mentioned earlier, modulo is rather expensive for something executed so frequently... > + break; > + > + case FREELIST_PARTITION_NODE: > + freelist_idx = node % strategy_nnodes; > + break; Here we shouldn't need modulo, right? > + > + case FREELIST_PARTITION_PID: > + freelist_idx = MyProcPid % strategy_ncpus; > + break; > + > + default: > + elog(ERROR, "unknown freelist partitioning value"); > + } > + > + return &StrategyControl->freelists[freelist_idx]; > +} > /* size of lookup hash table ... see comment in StrategyInitialize */ > size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS)); > > /* size of the shared replacement strategy control block */ > - size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl))); > + size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists))); > + > + /* > + * Allocate one frelist per CPU. We might use per-node freelists, but the > + * assumption is the number of CPUs is less than number of NUMA nodes. > + * > + * FIXME This assumes the we have more CPUs than NUMA nodes, which seems > + * like a safe assumption. But maybe we should calculate how many elements > + * we actually need, depending on the GUC? Not a huge amount of memory. FWIW, I don't think that's a safe assumption anymore. With CXL we can get a) PCIe attached memory and b) remote memory as a separate NUMA nodes, and that very well could end up as more NUMA nodes than cores. Ugh, -ETOOLONG. Gotta schedule some other things... Greetings, Andres Freund
On Wed, Jul 9, 2025 at 7:13 PM Andres Freund <andres@anarazel.de> wrote: > > Yes, and we are discussing if it is worth getting into smaller pages > > for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or > > what more even more waste 1GB hugetlb if we dont request 2MB for some > > small structs: btw, we have ability to select MAP_HUGE_2MB vs > > MAP_HUGE_1GB). I'm thinking about two problems: > > - 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning > > I'm not really bought into this being a problem. If your system has enough > pressure to swap out the PGPROC array, you're so hosed that this won't make a > difference. OK I need to bend here, yet still part of me believes that the situation where we have hugepages (for 'Buffer Blocks') and yet some smaller more, but way critical structs are more likely to be swapped out due to pressure of some backend-gone-wild random mallocs() is unhealthy to me (especially the fact the OS might prefer swapping on per node rather than global picture) > I'm rather doubtful that it's a good idea to combine numa awareness with numa > balancing. Numa balancing adds latency and makes it much more expensive for > userspace to act in a numa aware way, since it needs to regularly update its > knowledge about where memory resides. Well the problem is that backends come here and go to random CPUs often (migrated++ on very high backend counts and non-uniform workloads in terms of backend-CPU usage), but the autobalancing doesn't need to be on or off for everything. It could be autobalancing for a certain memory region and it is not affecting the app in any way (well, other than those minor page faulting, literally ). > If we used 4k pages for the procarray we would just have ~4 procs on one page, > if that range were marked as interleaved, it'd probably suffice. OK, this sounds like the best and simplest proposal to me, yet the patch doesn't do OS-based interleaving for those today. Gonna try that mlock() sooner or later... ;) -J.
On Wed, Jul 9, 2025 at 9:42 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote: > > Each patch has a numa_ GUC, intended to enable/disable that part. This > > is meant to make development easier, not as a final interface. I'm not > > sure how exactly that should look. It's possible some combinations of > > GUCs won't work, etc. > > Wonder if some of it might be worth putting into a multi-valued GUC (like > debug_io_direct). Long-term or for experimentation? Also please see below as it is related: [..] > FWIW, I don't think that's a safe assumption anymore. With CXL we can get a) > PCIe attached memory and b) remote memory as a separate NUMA nodes, and that > very well could end up as more NUMA nodes than cores. In my earlier apparently very way too naive approach, I've tried to handle this CXL scenario, but I'm afraid this cannot be done without further configuration, please see review/use cases [1] and [2] -J. [1] https://www.postgresql.org/message-id/attachment/178119/v4-0001-Add-capability-to-interleave-shared-memory-across.patch - just see sgml/GUC and we have numa_parse_nodestring(3) [2] https://www.postgresql.org/message-id/aAKPMrX1Uq6quKJy%40ip-10-97-1-34.eu-west-3.compute.internal
> On Jul 9, 2025, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-07-09 12:55:51 -0400, Greg Burd wrote: >> On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote: >> >>> FWIW, I've started to wonder if we shouldn't just get rid of the freelist >>> entirely. While clocksweep is perhaps minutely slower in a single >>> thread than >>> the freelist, clock sweep scales *considerably* better [1]. As it's rather >>> rare to be bottlenecked on clock sweep speed for a single thread >>> (rather then >>> IO or memory copy overhead), I think it's worth favoring clock sweep. >> >> Hey Andres, thanks for spending time on this. I've worked before on >> freelist implementations (last one in LMDB) and I think you're onto >> something. I think it's an innovative idea and that the speed >> difference will either be lost in the noise or potentially entirely >> mitigated by avoiding duplicate work. > > Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE > perform better because it doesn't need to maintain the freelist anymore... > > >>> Also needing to switch between getting buffers from the freelist and >>> the sweep >>> makes the code more expensive. I think just having the buffer in the sweep, >>> with a refcount / usagecount of zero would suffice. >> >> If you're not already coding this, I'll jump in. :) > > My experimental patch is literally a four character addition ;), namely adding > "0 &&" to the relevant code in StrategyGetBuffer(). > > Obviously a real patch would need to do some more work than that. Feel free > to take on that project, I am not planning on tackling that in near term. > I started on this last night, making good progress. Thanks for the inspiration. I'll create a new thread to track the workand cross-reference when I have something reasonable to show (hopefully later today). > There's other things around this that could use some attention. It's not hard > to see clock sweep be a bottleneck in concurrent workloads - partially due to > the shared maintenance of the clock hand. A NUMAed clock sweep would address > that. Working on it. Other than NUMA-fying clocksweep there is a function have_free_buffer() that might be a tad tricky to re-implementefficiently and/or make NUMA aware. Or maybe I can remove that too? It is used in autoprewarm.c and possiblyother extensions, but no where else in core. > However, we also maintain StrategyControl->numBufferAllocs, which is a > significant contention point and would not necessarily be removed by a > NUMAificiation of the clock sweep. Yep, I noted this counter and its potential for contention too. Fortunately, it seems like it is only used so that "bgwritercan estimate the rate of buffer consumption" which to me opens the door to a less accurate partitioned counter,perhaps something lock-free (no mutex/CAS) that is bucketed then combined when read. A quick look at bufmgr.c indicates that recent_allocs (which is StrategyControl->numBufferAllocs) is used to track a "movingaverage" and other voodoo there I've yet to fully grok. Any thoughts on this approximate count approach? Also, what are your thoughts on updating the algorithm to CLOCK-Pro [1] while I'm there? I guess I'd have to try it out,measure it a lot and see if there are any material benefits. Maybe I'll keep that for a future patch, or at least layerit... back to work! > Greetings, > > Andres Freund best. -greg [1] https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech/general/full_papers/jiang/jiang_html/html.html
Hi, On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote: > Hi, > > Thanks for working on this! Indeed, thanks! > On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote: > > 1) v1-0001-NUMA-interleaving-buffers.patch > > > > This is the main thing when people think about NUMA - making sure the > > shared buffers are allocated evenly on all the nodes, not just on a > > single node (which can happen easily with warmup). The regular memory > > interleaving would address this, but it also has some disadvantages. > > > > Firstly, it's oblivious to the contents of the shared memory segment, > > and we may not want to interleave everything. It's also oblivious to > > alignment of the items (a buffer can easily end up "split" on multiple > > NUMA nodes), or relationship between different parts (e.g. there's a > > BufferBlock and a related BufferDescriptor, and those might again end up > > on different nodes). > > Two more disadvantages: > > With OS interleaving postgres doesn't (not easily at least) know about what > maps to what, which means postgres can't do stuff like numa aware buffer > replacement. > > With OS interleaving the interleaving is "too fine grained", with pages being > mapped at each page boundary, making it less likely for things like one > strategy ringbuffer to reside on a single numa node. > > There's a secondary benefit of explicitly assigning buffers to nodes, > > using this simple scheme - it allows quickly determining the node ID > > given a buffer ID. This is helpful later, when building freelist. I do think this is a big advantage as compare to the OS interleaving. > I wonder if we should *increase* the size of shared_buffers whenever huge > pages are in use and there's padding space due to the huge page > boundaries. Pretty pointless to waste that memory if we can instead use if for > the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge > pages... I think that makes sense, except maybe for operations that need to scan the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)? > > 5) v1-0005-NUMA-interleave-PGPROC-entries.patch > > > > Another area that seems like it might benefit from NUMA is PGPROC, so I > > gave it a try. It turned out somewhat challenging. Similarly to buffers > > we have two pieces that need to be located in a coordinated way - PGPROC > > entries and fast-path arrays. But we can't use the same approach as for > > buffers/descriptors, because > > > > (a) Neither of those pieces aligns with memory page size (PGPROC is > > ~900B, fast-path arrays are variable length). > > > (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require > > rather high max_connections before we use multiple huge pages. > > Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common > cache line size) Oh right, it's currently 832 bytes and the patch extends that to 840 bytes. With a bit of reordering: diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h index 5cb1632718e..2ed2f94202a 100644 --- a/src/include/storage/proc.h +++ b/src/include/storage/proc.h @@ -194,8 +194,6 @@ struct PGPROC * vacuum must not remove tuples deleted by * xid >= xmin ! */ - int procnumber; /* index in ProcGlobal->allProcs */ - int pid; /* Backend's process ID; 0 if prepared xact */ int pgxactoff; /* offset into various ProcGlobal->arrays with @@ -243,6 +241,7 @@ struct PGPROC /* Support for condition variables. */ proclist_node cvWaitLink; /* position in CV wait list */ + int procnumber; /* index in ProcGlobal->allProcs */ /* Info about lock the process is currently waiting for, if any. */ /* waitLock and waitProcLock are NULL if not currently waiting. */ @@ -268,6 +267,7 @@ struct PGPROC */ XLogRecPtr waitLSN; /* waiting for this LSN or higher */ int syncRepState; /* wait state for sync rep */ + int numa_node; dlist_node syncRepLinks; /* list link if process is in syncrep queue */ /* @@ -321,9 +321,6 @@ struct PGPROC PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */ dlist_head lockGroupMembers; /* list of members, if I'm a leader */ dlist_node lockGroupLink; /* my member link, if I'm a member */ - - /* NUMA node */ - int numa_node; }; That could be back to 832 (the order does not make sense logically speaking though). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 7/9/25 08:40, Cédric Villemain wrote: >> On 7/8/25 18:06, Cédric Villemain wrote: >>> >>> >>> >>> >>> >>> >>>> On 7/8/25 03:55, Cédric Villemain wrote: >>>>> Hi Andres, >>>>> >>>>>> Hi, >>>>>> >>>>>> On 2025-07-05 07:09:00 +0000, Cédric Villemain wrote: >>>>>>> In my work on more careful PostgreSQL resource management, I've come >>>>>>> to the >>>>>>> conclusion that we should avoid pushing policy too deeply into the >>>>>>> PostgreSQL core itself. Therefore, I'm quite skeptical about >>>>>>> integrating >>>>>>> NUMA-specific management directly into core PostgreSQL in such a >>>>>>> way. >>>>>> >>>>>> I think it's actually the opposite - whenever we pushed stuff like >>>>>> this >>>>>> outside of core it has hurt postgres substantially. Not having >>>>>> replication in >>>>>> core was a huge mistake. Not having HA management in core is >>>>>> probably the >>>>>> biggest current adoption hurdle for postgres. >>>>>> >>>>>> To deal better with NUMA we need to improve memory placement and >>>>>> various >>>>>> algorithms, in an interrelated way - that's pretty much impossible >>>>>> to do >>>>>> outside of core. >>>>> >>>>> Except the backend pinning which is easy to achieve, thus my >>>>> comment on >>>>> the related patch. >>>>> I'm not claiming NUMA memory and all should be managed outside of core >>>>> (though I didn't read other patches yet). >>>>> >>>> >>>> But an "optimal backend placement" seems to very much depend on >>>> where we >>>> placed the various pieces of shared memory. Which the external module >>>> will have trouble following, I suspect. >>>> >>>> I still don't have any idea what exactly would the external module do, >>>> how would it decide where to place the backend. Can you describe some >>>> use case with an example? >>>> >>>> Assuming we want to actually pin tasks from within Postgres, what I >>>> think might work is allowing modules to "advise" on where to place the >>>> task. But the decision would still be done by core. >>> >>> Possibly exactly what you're doing in proc.c when managing allocation of >>> process, but not hardcoded in postgresql (patches 02, 05 and 06 are good >>> candidates), I didn't get that they require information not available to >>> any process executing code from a module. >>> >> >> Well, it needs to understand how some other stuff (especially PGPROC >> entries) is distributed between nodes. I'm not sure how much of this >> internal information we want to expose outside core ... >> >>> Parts of your code where you assign/define policy could be in one or >>> more relevant routines of a "numa profile manager", like in an >>> initProcessRoutine(), and registered in pmroutine struct: >>> >>> pmroutine = GetPmRoutineForInitProcess(); >>> if (pmroutine != NULL && >>> pmroutine->init_process != NULL) >>> pmroutine->init_process(MyProc); >>> >>> This way it's easier to manage alternative policies, and also to be able >>> to adjust when hardware and linux kernel changes. >>> >> >> I'm not against making this extensible, in some way. But I still >> struggle to imagine a reasonable alternative policy, where the external >> module gets the same information and ends up with a different decision. >> >> So what would the alternate policy look like? What use case would the >> module be supporting? > > > That's the whole point: there are very distinct usages of PostgreSQL in > the field. And maybe not all of them will require the policy defined by > PostgreSQL core. > > May I ask the reverse: what prevent external modules from taking those > decisions ? There are already a lot of area where external code can take > over PostgreSQL processing, like Neon is doing. > The complexity of making everything extensible in an arbitrary way. To make it extensible in a useful, we need to have a reasonably clear idea what aspects need to be extensible, and what's the goal. > There are some very early processing for memory setup that I can see as > a current blocker, and here I'd refer a more compliant NUMA api as > proposed by Jakub so it's possible to arrange based on workload, > hardware configuration or other matters. Reworking to get distinct > segment and all as you do is great, and combo of both approach probably > of great interest. There is also this weighted interleave discussed and > probably much more to come in this area in Linux. > > I think some points raised already about possible distinct policies, I > am precisely claiming that it is hard to come with one good policy with > limited setup options, thus requirement to keep that flexible enough > (hooks, api, 100 GUc ?). > I'm sorry, I don't want to sound too negative, but "I want arbitrary extensibility" is not a very useful feedback. I've asked you to give some examples of policies that'd customize some of the NUMA stuff. > There is an EPYC story here also, given the NUMA setup can vary > depending on BIOS setup, associated NUMA policy must probably take that > into account (L3 can be either real cache or 4 extra "local" NUMA nodes > - with highly distinct access cost from a RAM module). > Does that change how PostgreSQL will place memory and process? Is it > important or of interest ? > So how exactly would the policy handle this? Right now we're entirely oblivious to L3, or on-CPU caches in general. We don't even consider the size of L3 when sizing hash tables in a hashjoin etc. regards -- Tomas Vondra
On 7/9/25 19:23, Andres Freund wrote: > Hi, > > On 2025-07-09 12:55:51 -0400, Greg Burd wrote: >> On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote: >> >>> FWIW, I've started to wonder if we shouldn't just get rid of the freelist >>> entirely. While clocksweep is perhaps minutely slower in a single >>> thread than >>> the freelist, clock sweep scales *considerably* better [1]. As it's rather >>> rare to be bottlenecked on clock sweep speed for a single thread >>> (rather then >>> IO or memory copy overhead), I think it's worth favoring clock sweep. >> >> Hey Andres, thanks for spending time on this. I've worked before on >> freelist implementations (last one in LMDB) and I think you're onto >> something. I think it's an innovative idea and that the speed >> difference will either be lost in the noise or potentially entirely >> mitigated by avoiding duplicate work. > > Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE > perform better because it doesn't need to maintain the freelist anymore... > > >>> Also needing to switch between getting buffers from the freelist and >>> the sweep >>> makes the code more expensive. I think just having the buffer in the sweep, >>> with a refcount / usagecount of zero would suffice. >> >> If you're not already coding this, I'll jump in. :) > > My experimental patch is literally a four character addition ;), namely adding > "0 &&" to the relevant code in StrategyGetBuffer(). > > Obviously a real patch would need to do some more work than that. Feel free > to take on that project, I am not planning on tackling that in near term. > > > There's other things around this that could use some attention. It's not hard > to see clock sweep be a bottleneck in concurrent workloads - partially due to > the shared maintenance of the clock hand. A NUMAed clock sweep would address > that. However, we also maintain StrategyControl->numBufferAllocs, which is a > significant contention point and would not necessarily be removed by a > NUMAificiation of the clock sweep. > Wouldn't it make sense to partition the numBufferAllocs too, though? I don't remember if my hacky experimental patch NUMA-partitioning did that or I just thought about doing that, but why wouldn't that be enough? Places that need the "total" count would have to sum the counters, but it seemed to me most of the places would be fine with the "local" count for that partition. If we also make sure to "sync" the clocksweeps so as to not work on just a single partition, that might be enough ... regards -- Tomas Vondra
Hi, On 2025-07-10 17:31:45 +0200, Tomas Vondra wrote: > On 7/9/25 19:23, Andres Freund wrote: > > There's other things around this that could use some attention. It's not hard > > to see clock sweep be a bottleneck in concurrent workloads - partially due to > > the shared maintenance of the clock hand. A NUMAed clock sweep would address > > that. However, we also maintain StrategyControl->numBufferAllocs, which is a > > significant contention point and would not necessarily be removed by a > > NUMAificiation of the clock sweep. > > > > Wouldn't it make sense to partition the numBufferAllocs too, though? I > don't remember if my hacky experimental patch NUMA-partitioning did that > or I just thought about doing that, but why wouldn't that be enough? It could be solved together with partitioning, yes - that's what I was trying to reference with the emphasized bit in "would *not necessarily* be removed by a NUMAificiation of the clock sweep". Greetings, Andres Freund
Hi, On 2025-07-10 14:17:21 +0000, Bertrand Drouvot wrote: > On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote: > > I wonder if we should *increase* the size of shared_buffers whenever huge > > pages are in use and there's padding space due to the huge page > > boundaries. Pretty pointless to waste that memory if we can instead use if for > > the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge > > pages... > > I think that makes sense, except maybe for operations that need to scan > the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)? I don't think the increases here are big enough for that to matter, unless perhaps you're using 1GB huge pages. But if you're concerned about dropping tables very fast (i.e. you're running schema change heavy regression tests), you're not going to use 1GB huge pages. > > > 5) v1-0005-NUMA-interleave-PGPROC-entries.patch > > > > > > Another area that seems like it might benefit from NUMA is PGPROC, so I > > > gave it a try. It turned out somewhat challenging. Similarly to buffers > > > we have two pieces that need to be located in a coordinated way - PGPROC > > > entries and fast-path arrays. But we can't use the same approach as for > > > buffers/descriptors, because > > > > > > (a) Neither of those pieces aligns with memory page size (PGPROC is > > > ~900B, fast-path arrays are variable length). > > > > > (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require > > > rather high max_connections before we use multiple huge pages. > > > > Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common > > cache line size) > > Oh right, it's currently 832 bytes and the patch extends that to 840 bytes. I don't think the patch itself is the problem - it really is just happenstance that it's a multiple of the line size right now. And it's not on common Armv8 platforms... > With a bit of reordering: > > That could be back to 832 (the order does not make sense logically speaking > though). I don't think shrinking the size in a one-off way just to keep the "accidental" size-is-multiple-of-64 property is promising. It'll just get broken again. I think we should: a) pad the size of PGPROC to a cache line (or even to a subsequent power of 2, to make array access cheaper, right now that involves actual multiplications rather than shifts or indexed `lea` instructions). That's probably just a pg_attribute_aligned b) Reorder PGPROC to separate frequently modified from almost-read-only data, to increase cache hit ratio. Greetings, Andres Freund
> On Jul 10, 2025, at 8:13 AM, Burd, Greg <greg@burd.me> wrote: > > >> On Jul 9, 2025, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote: >> >> Hi, >> >> On 2025-07-09 12:55:51 -0400, Greg Burd wrote: >>> On Jul 9 2025, at 12:35 pm, Andres Freund <andres@anarazel.de> wrote: >>> >>>> FWIW, I've started to wonder if we shouldn't just get rid of the freelist >>>> entirely. While clocksweep is perhaps minutely slower in a single >>>> thread than >>>> the freelist, clock sweep scales *considerably* better [1]. As it's rather >>>> rare to be bottlenecked on clock sweep speed for a single thread >>>> (rather then >>>> IO or memory copy overhead), I think it's worth favoring clock sweep. >>> >>> Hey Andres, thanks for spending time on this. I've worked before on >>> freelist implementations (last one in LMDB) and I think you're onto >>> something. I think it's an innovative idea and that the speed >>> difference will either be lost in the noise or potentially entirely >>> mitigated by avoiding duplicate work. >> >> Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE >> perform better because it doesn't need to maintain the freelist anymore... >> >> >>>> Also needing to switch between getting buffers from the freelist and >>>> the sweep >>>> makes the code more expensive. I think just having the buffer in the sweep, >>>> with a refcount / usagecount of zero would suffice. >>> >>> If you're not already coding this, I'll jump in. :) >> >> My experimental patch is literally a four character addition ;), namely adding >> "0 &&" to the relevant code in StrategyGetBuffer(). >> >> Obviously a real patch would need to do some more work than that. Feel free >> to take on that project, I am not planning on tackling that in near term. >> > > I started on this last night, making good progress. Thanks for the inspiration. I'll create a new thread to track thework and cross-reference when I have something reasonable to show (hopefully later today). > >> There's other things around this that could use some attention. It's not hard >> to see clock sweep be a bottleneck in concurrent workloads - partially due to >> the shared maintenance of the clock hand. A NUMAed clock sweep would address >> that. > > Working on it. For archival sake, and to tie up loose ends I'll link from here to a new thread I just started that proposes the removalof the freelist and the buffer_strategy_lock [1]. That patch set doesn't address any NUMA-related tasks directly, but it should remove some pain when working in that directionby removing code that requires partitioning and locking and... best. -greg [1] https://postgr.es/m/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10@burd.me
Hi, Here's a v2 of the patch series, with a couple changes: * I simplified the various freelist partitioning by keeping only the "node" partitioning (so the cpu/pid strategies are gone). Those were meant for experimenting, but it made the code more complicated so I ditched it. * I changed the freelist partitioning scheme a little bit, based on the discussion in this thread. Instead of having a single "partition" per NUMA node, there's not a minimum number of partitions (set to 4). So even if your system is not NUMA, you'll have 4 of them. If you have 2 nodes, you'll still have 4, and each node will get 2. With 3 nodes we get 6 partitions (we need 2 per node, and we want to keep the number equal to keep things simple). Once the number of nodes exceeds 4, the heuristics switches to one partition per node. I'm aware there's a discussion about maybe simply removing freelists entirely. If that happens, this becomes mostly irrelevant, of course. The code should also make sure the freelists "agree" with how the earlier patch mapped the buffers to NUMA nodes, i.e. the freelist should only contain buffers from the "correct" NUMA node, etc. I haven't paid much attention to this - I believe it should work for "nice" values of shared buffers (when it evenly divides between nodes). But I'm sure it's possible to confuse that (won't cause crashes, but inefficiency). * There's now a patch partitioning clocksweep, using the same scheme as the freelists. I came to the conclusion it doesn't make much sense to partition these things differently - I can't think of a reason why that would be advantageous, and it makes it easier to reason about. The clocksweep partitioning is somewhat harder, because it affects BgBufferSync() and related code. With the partitioning we now have multiple "clock hands" for different ranges of buffers, and the clock sweep needs to consider that. I modified BgBufferSync to simply loop through the ClockSweep partitions, and do a small cleanup for each. It does work, as in "it doesn't crash". But this part definitely needs review to make sure I got the changes to the "pacing" right. * This new freelist/clocksweep partitioning scheme is however harder to disable. I now realize the GUC may quite do the trick, and there even is not a GUC for the clocksweep. I need to think about this, but I'm not even how feasible it'd be to have two separate GUCs (because of how these two pieces are intertwined). For now if you want to test without the partitioning, you need to skip the patch. I did some quick perf testing on my old xeon machine (2 NUMA nodes), and the results are encouraging. For a read-only pgbench (2x shared buffers, within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not bad considering the patch is more about consistency than raw throughput. For a read-write pgbench I however saw some strange drops/increases of throughput. I suspect this might be due to some thinko in the clocksweep partitioning, but I'll need to take a closer look. regards -- Tomas Vondra
Вложения
- v2-0007-NUMA-pin-backends-to-NUMA-nodes.patch
- v2-0006-NUMA-interleave-PGPROC-entries.patch
- v2-0005-NUMA-clockweep-partitioning.patch
- v2-0004-NUMA-partition-buffer-freelist.patch
- v2-0003-freelist-Don-t-track-tail-of-a-freelist.patch
- v2-0002-NUMA-localalloc.patch
- v2-0001-NUMA-interleaving-buffers.patch
On 7/4/25 20:12, Tomas Vondra wrote: > On 7/4/25 13:05, Jakub Wartak wrote: >> ... >> >> 8. v1-0005 2x + /* if (numa_procs_interleave) */ >> >> Ha! it's a TRAP! I've uncommented it because I wanted to try it out >> without it (just by setting GUC off) , but "MyProc->sema" is NULL : >> >> 2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL >> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit >> [..] >> 2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755) >> was terminated by signal 11: Segmentation fault >> 2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other >> active server processes >> 2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because >> "restart_after_crash" is off >> 2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down >> >> [New LWP 28755] >> [Thread debugging using libthread_db enabled] >> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". >> Core was generated by `postgres: io worker '. >> Program terminated with signal SIGSEGV, Segmentation fault. >> #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) >> at ./nptl/sem_waitcommon.c:136 >> 136 ./nptl/sem_waitcommon.c: No such file or directory. >> (gdb) where >> #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) >> at ./nptl/sem_waitcommon.c:136 >> #1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81 >> #2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at >> ../src/backend/port/posix_sema.c:302 >> #3 0x0000556191970553 in InitAuxiliaryProcess () at >> ../src/backend/storage/lmgr/proc.c:992 >> #4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at >> ../src/backend/postmaster/auxprocess.c:65 >> #5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized >> out>, startup_data_len=<optimized out>) at >> ../src/backend/storage/aio/method_worker.c:393 >> #6 0x00005561918e8163 in postmaster_child_launch >> (child_type=child_type@entry=B_IO_WORKER, child_slot=20086, >> startup_data=startup_data@entry=0x0, >> startup_data_len=startup_data_len@entry=0, >> client_sock=client_sock@entry=0x0) at >> ../src/backend/postmaster/launch_backend.c:290 >> #7 0x00005561918ea09a in StartChildProcess >> (type=type@entry=B_IO_WORKER) at >> ../src/backend/postmaster/postmaster.c:3973 >> #8 0x00005561918ea308 in maybe_adjust_io_workers () at >> ../src/backend/postmaster/postmaster.c:4404 >> [..] >> (gdb) print *MyProc->sem >> Cannot access memory at address 0x0 >> > > Yeah, good catch. I'll look into that next week. > I've been unable to reproduce this issue, but I'm not sure what settings you actually used for this instance. Can you give me more details how to reproduce this? regards -- Tomas Vondra
Hi, On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote: > Here's a v2 of the patch series, with a couple changes: Not a deep look at the code, just a quick reply. > * I changed the freelist partitioning scheme a little bit, based on the > discussion in this thread. Instead of having a single "partition" per > NUMA node, there's not a minimum number of partitions (set to 4). So I assume s/not/now/? > * There's now a patch partitioning clocksweep, using the same scheme as > the freelists. Nice! > I came to the conclusion it doesn't make much sense to partition these > things differently - I can't think of a reason why that would be > advantageous, and it makes it easier to reason about. Agreed. > The clocksweep partitioning is somewhat harder, because it affects > BgBufferSync() and related code. With the partitioning we now have > multiple "clock hands" for different ranges of buffers, and the clock > sweep needs to consider that. I modified BgBufferSync to simply loop > through the ClockSweep partitions, and do a small cleanup for each. That probably makes sense for now. It might need a bit of a larger adjustment at some point, but ... > * This new freelist/clocksweep partitioning scheme is however harder to > disable. I now realize the GUC may quite do the trick, and there even is > not a GUC for the clocksweep. I need to think about this, but I'm not > even how feasible it'd be to have two separate GUCs (because of how > these two pieces are intertwined). For now if you want to test without > the partitioning, you need to skip the patch. I think it's totally fair to enable/disable them at the same time. They're so closely related, that I don't think it really makes sense to measure them separately. > I did some quick perf testing on my old xeon machine (2 NUMA nodes), and > the results are encouraging. For a read-only pgbench (2x shared buffers, > within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not > bad considering the patch is more about consistency than raw throughput. Personally I think an 1.18x improvement on a relatively small NUMA machine is really rather awesome. > For a read-write pgbench I however saw some strange drops/increases of > throughput. I suspect this might be due to some thinko in the clocksweep > partitioning, but I'll need to take a closer look. Was that with pinning etc enabled or not? > From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001 > From: Tomas Vondra <tomas@vondra.me> > Date: Sun, 8 Jun 2025 18:53:12 +0200 > Subject: [PATCH v2 5/7] NUMA: clockweep partitioning > > @@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r > /* > * Nothing on the freelist, so run the "clock sweep" algorithm > * > - * XXX Should we also make this NUMA-aware, to only access buffers from > - * the same NUMA node? That'd probably mean we need to make the clock > - * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a > - * subset of buffers. But that also means each process could "sweep" only > - * a fraction of buffers, even if the other buffers are better candidates > - * for eviction. Would that also mean we'd have multiple bgwriters, one > - * for each node, or would one bgwriter handle all of that? > + * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at > + * buffers from a single partition, aligned with the NUMA node. That > + * means it only accesses buffers from the same NUMA node. > + * > + * XXX That also means each process "sweeps" only a fraction of buffers, > + * even if the other buffers are better candidates for eviction. Maybe > + * there should be some logic to "steal" buffers from other freelists > + * or other nodes? I think we *definitely* need "stealing" from other clock sweeps, whenever there's a meaningful imbalance between the different sweeps. I don't think we need to be overly precise about it, a small imbalance won't have that much of an effect. But clearly it doesn't make sense to say that one backend can only fill buffers in the current partition, that'd lead to massive performance issues in a lot of workloads. The hardest thing probably is to make the logic for when to check foreign clock sweeps cheap enough. One way would be to do it whenever a sweep wraps around, that'd probably amortize the cost sufficiently, and I don't think it'd be too imprecise, as we'd have processed that set of buffers in a row without partitioning as well. But it'd probably be too coarse when determining for how long to use a foreign sweep instance. But we probably could address that by rechecking the balanace more frequently when using a foreign partition. Another way would be to have bgwriter manage this. Whenever it detects that one ring is too far ahead, it could set a "avoid this partition" bit, which would trigger backends that natively use that partition to switch to foreign partitions that don't currently have that bit set. I suspect there's a problem with that approach though, I worry that the amount of time that bgwriter spends in BgBufferSync() may sometimes be too long, leading to too much imbalance. Greetings, Andres Freund
On 7/18/25 18:46, Andres Freund wrote: > Hi, > > On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote: >> Here's a v2 of the patch series, with a couple changes: > > Not a deep look at the code, just a quick reply. > > >> * I changed the freelist partitioning scheme a little bit, based on the >> discussion in this thread. Instead of having a single "partition" per >> NUMA node, there's not a minimum number of partitions (set to 4). So > > I assume s/not/now/? > Yes. > >> * There's now a patch partitioning clocksweep, using the same scheme as >> the freelists. > > Nice! > > >> I came to the conclusion it doesn't make much sense to partition these >> things differently - I can't think of a reason why that would be >> advantageous, and it makes it easier to reason about. > > Agreed. > > >> The clocksweep partitioning is somewhat harder, because it affects >> BgBufferSync() and related code. With the partitioning we now have >> multiple "clock hands" for different ranges of buffers, and the clock >> sweep needs to consider that. I modified BgBufferSync to simply loop >> through the ClockSweep partitions, and do a small cleanup for each. > > That probably makes sense for now. It might need a bit of a larger adjustment at some point, but ... > I couldn't think of something fundamentally better and not too complex. I suspect we might want to use multiple bgwriters in the future, and this scheme seems to be reasonably well suited for that too. I'm also thinking about having some sort of "unified" partitioning scheme for all the places partitioning shared buffers. Right now each of the places does it on it's own, i.e. buff_init, freelist and clocksweep all have their code splitting NBuffers into partitions. And it should align. Because what would be the benefit if it didn't? But I guess having three variants of the same code seems a bit pointless. I think buff_init should build a common definition of buffer partitions, and the remaining parts should use that as the source of truth ... > >> * This new freelist/clocksweep partitioning scheme is however harder to >> disable. I now realize the GUC may quite do the trick, and there even is >> not a GUC for the clocksweep. I need to think about this, but I'm not >> even how feasible it'd be to have two separate GUCs (because of how >> these two pieces are intertwined). For now if you want to test without >> the partitioning, you need to skip the patch. > > I think it's totally fair to enable/disable them at the same time. They're so > closely related, that I don't think it really makes sense to measure them > separately. > Yeah, that's a fair point. > >> I did some quick perf testing on my old xeon machine (2 NUMA nodes), and >> the results are encouraging. For a read-only pgbench (2x shared buffers, >> within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not >> bad considering the patch is more about consistency than raw throughput. > > Personally I think an 1.18x improvement on a relatively small NUMA machine is > really rather awesome. > True, but I want to stress out it's just one quick (& simple test). Much more testing is needed before I can make reliable claims. > >> For a read-write pgbench I however saw some strange drops/increases of >> throughput. I suspect this might be due to some thinko in the clocksweep >> partitioning, but I'll need to take a closer look. > > Was that with pinning etc enabled or not? > IIRC it was with everything enabled, except for numa_procs_pin (which pins backend to NUMA node). I found that to actually harm performance in some of the tests (even just read-only ones), resulting in uneven usage of cores and lower throughput. > > >> From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001 >> From: Tomas Vondra <tomas@vondra.me> >> Date: Sun, 8 Jun 2025 18:53:12 +0200 >> Subject: [PATCH v2 5/7] NUMA: clockweep partitioning >> > > >> @@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r >> /* >> * Nothing on the freelist, so run the "clock sweep" algorithm >> * >> - * XXX Should we also make this NUMA-aware, to only access buffers from >> - * the same NUMA node? That'd probably mean we need to make the clock >> - * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a >> - * subset of buffers. But that also means each process could "sweep" only >> - * a fraction of buffers, even if the other buffers are better candidates >> - * for eviction. Would that also mean we'd have multiple bgwriters, one >> - * for each node, or would one bgwriter handle all of that? >> + * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at >> + * buffers from a single partition, aligned with the NUMA node. That >> + * means it only accesses buffers from the same NUMA node. >> + * >> + * XXX That also means each process "sweeps" only a fraction of buffers, >> + * even if the other buffers are better candidates for eviction. Maybe >> + * there should be some logic to "steal" buffers from other freelists >> + * or other nodes? > > I think we *definitely* need "stealing" from other clock sweeps, whenever > there's a meaningful imbalance between the different sweeps. > > I don't think we need to be overly precise about it, a small imbalance won't > have that much of an effect. But clearly it doesn't make sense to say that one > backend can only fill buffers in the current partition, that'd lead to massive > performance issues in a lot of workloads. > Agreed. > The hardest thing probably is to make the logic for when to check foreign > clock sweeps cheap enough. > > One way would be to do it whenever a sweep wraps around, that'd probably > amortize the cost sufficiently, and I don't think it'd be too imprecise, as > we'd have processed that set of buffers in a row without partitioning as > well. But it'd probably be too coarse when determining for how long to use a > foreign sweep instance. But we probably could address that by rechecking the > balanace more frequently when using a foreign partition. > What you mean by "it"? What would happen after a sweep wraps around? > Another way would be to have bgwriter manage this. Whenever it detects that > one ring is too far ahead, it could set a "avoid this partition" bit, which > would trigger backends that natively use that partition to switch to foreign > partitions that don't currently have that bit set. I suspect there's a > problem with that approach though, I worry that the amount of time that > bgwriter spends in BgBufferSync() may sometimes be too long, leading to too > much imbalance. > I'm afraid having hard "avoid" flags would lead to sudden and unexpected changes in performance as we enable/disable partitions. I think a good solution should "smooth it out" somehow, e.g. by not having a true/false flag, but having some sort of "preference" factor with values between (0.0, 1.0) which says how much we should use that partition. I was imagining something like this: Say we know the number of buffers allocated for each partition (in the last round), and we (or rather the BgBufferSync) calculate: coefficient = 1.0 - (nallocated_partition / nallocated) and then use that to "correct" which partition to allocate buffers from. Or maybe just watch how far from the "fair share" we were in the last interval, and gradually increase/decrease the "partition preference" which would say how often we need to "steal" from other partitions. E.g. we find nallocated_partition is 2x the fair share, i.e. nallocated_partition / (nallocated / nparts) = 2.0 Then we say 25% of the time look at some other partition, to "cut" the imbalance in half. And then repeat that in the next cycle, etc. So a process would look at it's "home partition" by default, but it's "roll a dice" first and if above the calculated probability it'd pick some other partition instead (this would need to be done so that it gets balanced overall). If the bgwriter interval is too long, maybe the recalculation could be triggered regularly after any of the clocksweeps wraps around, or after some number of allocations, or something like that. regards -- Tomas Vondra
Hi, On 2025-07-18 22:48:00 +0200, Tomas Vondra wrote: > On 7/18/25 18:46, Andres Freund wrote: > >> For a read-write pgbench I however saw some strange drops/increases of > >> throughput. I suspect this might be due to some thinko in the clocksweep > >> partitioning, but I'll need to take a closer look. > > > > Was that with pinning etc enabled or not? > > > > IIRC it was with everything enabled, except for numa_procs_pin (which > pins backend to NUMA node). I found that to actually harm performance in > some of the tests (even just read-only ones), resulting in uneven usage > of cores and lower throughput. FWIW, I really doubt that something like numa_procs_pin is viable outside of very narrow niches until we have a *lot* more infrastructure in place. Like PG would need to be threaded, we'd need a separation between thread and connection and an executor that'd allow us to switch from working on one query to working on another query. > > The hardest thing probably is to make the logic for when to check foreign > > clock sweeps cheap enough. > > > > One way would be to do it whenever a sweep wraps around, that'd probably > > amortize the cost sufficiently, and I don't think it'd be too imprecise, as > > we'd have processed that set of buffers in a row without partitioning as > > well. But it'd probably be too coarse when determining for how long to use a > > foreign sweep instance. But we probably could address that by rechecking the > > balanace more frequently when using a foreign partition. > > > > What you mean by "it"? it := Considering switching back from using a "foreign" clock sweep instance whenever the sweep wraps around. > What would happen after a sweep wraps around? The scenario I'm worried about is this: 1) a bunch of backends read buffers on numa node A, using the local clock sweep instance 2) due to all of that activity, the clock sweep advances much faster than the clock sweep for numa node B 3) the clock sweep on A wraps around, we discover the imbalance, and all the backend switch to scanning on numa node B, moving that clock sweep ahead much more aggressively 4) clock sweep on B wraps around, there's imbalance the other way round now, so they all switch back to A > > Another way would be to have bgwriter manage this. Whenever it detects that > > one ring is too far ahead, it could set a "avoid this partition" bit, which > > would trigger backends that natively use that partition to switch to foreign > > partitions that don't currently have that bit set. I suspect there's a > > problem with that approach though, I worry that the amount of time that > > bgwriter spends in BgBufferSync() may sometimes be too long, leading to too > > much imbalance. > > > > I'm afraid having hard "avoid" flags would lead to sudden and unexpected > changes in performance as we enable/disable partitions. I think a good > solution should "smooth it out" somehow, e.g. by not having a true/false > flag, but having some sort of "preference" factor with values between > (0.0, 1.0) which says how much we should use that partition. Yea, I think that's a fair worry. > I was imagining something like this: > > Say we know the number of buffers allocated for each partition (in the > last round), and we (or rather the BgBufferSync) calculate: > > coefficient = 1.0 - (nallocated_partition / nallocated) > > and then use that to "correct" which partition to allocate buffers from. > Or maybe just watch how far from the "fair share" we were in the last > interval, and gradually increase/decrease the "partition preference" > which would say how often we need to "steal" from other partitions. > > E.g. we find nallocated_partition is 2x the fair share, i.e. > > nallocated_partition / (nallocated / nparts) = 2.0 > > Then we say 25% of the time look at some other partition, to "cut" the > imbalance in half. And then repeat that in the next cycle, etc. > > So a process would look at it's "home partition" by default, but it's > "roll a dice" first and if above the calculated probability it'd pick > some other partition instead (this would need to be done so that it gets > balanced overall). That does sound reasonable. > If the bgwriter interval is too long, maybe the recalculation could be > triggered regularly after any of the clocksweeps wraps around, or after > some number of allocations, or something like that. I'm pretty sure the bgwriter might not be often enough and not predictably frequently running for that. Greetings, Andres Freund
On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 7/4/25 20:12, Tomas Vondra wrote: > > On 7/4/25 13:05, Jakub Wartak wrote: > >> ... > >> > >> 8. v1-0005 2x + /* if (numa_procs_interleave) */ > >> > >> Ha! it's a TRAP! I've uncommented it because I wanted to try it out > >> without it (just by setting GUC off) , but "MyProc->sema" is NULL : > >> > >> 2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL > >> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit > >> [..] > >> 2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755) > >> was terminated by signal 11: Segmentation fault > >> 2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other > >> active server processes > >> 2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because > >> "restart_after_crash" is off > >> 2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down > >> > >> [New LWP 28755] > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > >> Core was generated by `postgres: io worker '. > >> Program terminated with signal SIGSEGV, Segmentation fault. > >> #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) > >> at ./nptl/sem_waitcommon.c:136 > >> 136 ./nptl/sem_waitcommon.c: No such file or directory. > >> (gdb) where > >> #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) > >> at ./nptl/sem_waitcommon.c:136 > >> #1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81 > >> #2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at > >> ../src/backend/port/posix_sema.c:302 > >> #3 0x0000556191970553 in InitAuxiliaryProcess () at > >> ../src/backend/storage/lmgr/proc.c:992 > >> #4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at > >> ../src/backend/postmaster/auxprocess.c:65 > >> #5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized > >> out>, startup_data_len=<optimized out>) at > >> ../src/backend/storage/aio/method_worker.c:393 > >> #6 0x00005561918e8163 in postmaster_child_launch > >> (child_type=child_type@entry=B_IO_WORKER, child_slot=20086, > >> startup_data=startup_data@entry=0x0, > >> startup_data_len=startup_data_len@entry=0, > >> client_sock=client_sock@entry=0x0) at > >> ../src/backend/postmaster/launch_backend.c:290 > >> #7 0x00005561918ea09a in StartChildProcess > >> (type=type@entry=B_IO_WORKER) at > >> ../src/backend/postmaster/postmaster.c:3973 > >> #8 0x00005561918ea308 in maybe_adjust_io_workers () at > >> ../src/backend/postmaster/postmaster.c:4404 > >> [..] > >> (gdb) print *MyProc->sem > >> Cannot access memory at address 0x0 > >> > > > > Yeah, good catch. I'll look into that next week. > > > > I've been unable to reproduce this issue, but I'm not sure what settings > you actually used for this instance. Can you give me more details how to > reproduce this? Better late than never, well feel free to partially ignore me, i've missed that it is known issue as per FIXME there, but I would just rip out that commented out `if(numa_proc_interleave)` from FastPathLockShmemSize() and PGProcShmemSize() unless you want to save those memory pages of course (in case of no-NUMA). If you do want to save those pages I think we have problem: For complete picture, steps: 1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch 2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch BTW the pgbench accidentinal ident is still there (part of v2-0001 patch)) 14 out of 14 hunks FAILED -- saving rejects to file src/bin/pgbench/pgbench.c.rej 3. As I'm just applying 0001 and 0006, I've got two simple rejects, but fixed it (due to not applying missing numa_ freelist patches). That's intentional on my part, because I wanted to play just with those two. 4. Then I uncomment those two "if (numa_procs_interleave)" related for optional memory shm initialization - add_size() and so on (that have XXX comment above that it is causing bootstrap issues) 5. initdb with numa_procs_interleave=on, huge_pages = on (!), start, it is ok 6. restart with numa_procs_interleave=off, which gets me to every bg worker crashing e.g.: (gdb) where #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at ./nptl/sem_waitcommon.c:136 #1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81 #2 0x0000563e2d6e4d5c in PGSemaphoreReset (sema=0x0) at ../src/backend/port/posix_sema.c:302 #3 0x0000563e2d774d93 in InitAuxiliaryProcess () at ../src/backend/storage/lmgr/proc.c:995 #4 0x0000563e2d6e9252 in AuxiliaryProcessMainCommon () at ../src/backend/postmaster/auxprocess.c:65 #5 0x0000563e2d6eb683 in CheckpointerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/checkpointer.c:190 #6 0x0000563e2d6ec363 in postmaster_child_launch (child_type=child_type@entry=B_CHECKPOINTER, child_slot=249, startup_data=startup_data@entry=0x0, startup_data_len=startup_data_len@entry=0, client_sock=client_sock@entry=0x0) at ../src/backend/postmaster/launch_backend.c:290 #7 0x0000563e2d6ee29a in StartChildProcess (type=type@entry=B_CHECKPOINTER) at ../src/backend/postmaster/postmaster.c:3973 #8 0x0000563e2d6f17a6 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x563e377cc0e0) at ../src/backend/postmaster/postmaster.c:1386 #9 0x0000563e2d4948fc in main (argc=3, argv=0x563e377cc0e0) at ../src/backend/main/main.c:231 notice sema=0x0, because: #3 0x000056050928cd93 in InitAuxiliaryProcess () at ../src/backend/storage/lmgr/proc.c:995 995 PGSemaphoreReset(MyProc->sem); (gdb) print MyProc $1 = (PGPROC *) 0x7f09a0c013b0 (gdb) print MyProc->sem $2 = (PGSemaphore) 0x0 or with printfs: 2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal PGPROC=0x7f9de827b880 requestSize=148770 // after proc && ptr manipulation: 2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal PGPROC=0x7f9de827bdf0 requestSize=148770 procs=0x7f9de827b880 ptr=0x7f9de827bdf0 [..initialization of aux PGPROCs i=0.., still fromInitProcGlobal(), each gets proper sem allocated as one would expect:] [..for i loop:] 2025-07-25 11:17:23.689 CEST [21772] LOG: i=136 , proc=0x7f9de8600000, proc->sem=0x7f9da4e04438 2025-07-25 11:17:23.689 CEST [21772] LOG: i=137 , proc=0x7f9de8600348, proc->sem=0x7f9da4e044b8 2025-07-25 11:17:23.689 CEST [21772] LOG: i=138 , proc=0x7f9de8600690, proc->sem=0x7f9da4e04538 [..but then in the children codepaths, out of the blue in InitAuxilaryProcess the whole MyProc looks like it would memsetted to zeros:] 2025-07-25 11:17:23.693 CEST [21784] LOG: auxiliary process using MyProc=0x7f9de8600000 auxproc=0x7f9de8600000 proctype=0 MyProcPid=21784 MyProc->sem=(nil) above got pgproc slot i=136 with addr 0x7f9de8600000 and later that auxiliary is launched but somehow something NULLified ->sem there (according to gdb , everything is zero there) 7. Original patch v2-0006 (with commented out 2x if numa_procs_interleave), behaves OK, so in my case here with 1x NUMA node that gives add_size(.., 1+1 * 2MB)=4MB 2025-07-25 11:38:54.131 CEST [23939] LOG: in InitProcGlobal PGPROC=0x7f25cbe7b880 requestSize=4343074 2025-07-25 11:38:54.132 CEST [23939] LOG: in InitProcGlobal PGPROC=0x7f25cbe7bdf0 requestSize=4343074 procs=0x7f25cbe7b880 ptr=0x7f25cbe7bdf0 so something is zeroing out all those MyProc structures apparently on startup (probably due to some wrong alignment maybe somewhere ?) I was thinking about trapping via mprotect() this single i=136 0x7f9de8600000 PGPROC to see what is resetting it, but oh well, mprotect() works only on whole pages... -J.
On 7/25/25 12:27, Jakub Wartak wrote: > On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> On 7/4/25 20:12, Tomas Vondra wrote: >>> On 7/4/25 13:05, Jakub Wartak wrote: >>>> ... >>>> >>>> 8. v1-0005 2x + /* if (numa_procs_interleave) */ >>>> >>>> Ha! it's a TRAP! I've uncommented it because I wanted to try it out >>>> without it (just by setting GUC off) , but "MyProc->sema" is NULL : >>>> >>>> 2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL >>>> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit >>>> [..] >>>> 2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755) >>>> was terminated by signal 11: Segmentation fault >>>> 2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other >>>> active server processes >>>> 2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because >>>> "restart_after_crash" is off >>>> 2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down >>>> >>>> [New LWP 28755] >>>> [Thread debugging using libthread_db enabled] >>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". >>>> Core was generated by `postgres: io worker '. >>>> Program terminated with signal SIGSEGV, Segmentation fault. >>>> #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) >>>> at ./nptl/sem_waitcommon.c:136 >>>> 136 ./nptl/sem_waitcommon.c: No such file or directory. >>>> (gdb) where >>>> #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) >>>> at ./nptl/sem_waitcommon.c:136 >>>> #1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81 >>>> #2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at >>>> ../src/backend/port/posix_sema.c:302 >>>> #3 0x0000556191970553 in InitAuxiliaryProcess () at >>>> ../src/backend/storage/lmgr/proc.c:992 >>>> #4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at >>>> ../src/backend/postmaster/auxprocess.c:65 >>>> #5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized >>>> out>, startup_data_len=<optimized out>) at >>>> ../src/backend/storage/aio/method_worker.c:393 >>>> #6 0x00005561918e8163 in postmaster_child_launch >>>> (child_type=child_type@entry=B_IO_WORKER, child_slot=20086, >>>> startup_data=startup_data@entry=0x0, >>>> startup_data_len=startup_data_len@entry=0, >>>> client_sock=client_sock@entry=0x0) at >>>> ../src/backend/postmaster/launch_backend.c:290 >>>> #7 0x00005561918ea09a in StartChildProcess >>>> (type=type@entry=B_IO_WORKER) at >>>> ../src/backend/postmaster/postmaster.c:3973 >>>> #8 0x00005561918ea308 in maybe_adjust_io_workers () at >>>> ../src/backend/postmaster/postmaster.c:4404 >>>> [..] >>>> (gdb) print *MyProc->sem >>>> Cannot access memory at address 0x0 >>>> >>> >>> Yeah, good catch. I'll look into that next week. >>> >> >> I've been unable to reproduce this issue, but I'm not sure what settings >> you actually used for this instance. Can you give me more details how to >> reproduce this? > > Better late than never, well feel free to partially ignore me, i've > missed that it is known issue as per FIXME there, but I would just rip > out that commented out `if(numa_proc_interleave)` from > FastPathLockShmemSize() and PGProcShmemSize() unless you want to save > those memory pages of course (in case of no-NUMA). If you do want to > save those pages I think we have problem: > > For complete picture, steps: > > 1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch > 2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch > > BTW the pgbench accidentinal ident is still there (part of v2-0001 patch)) > 14 out of 14 hunks FAILED -- saving rejects to file > src/bin/pgbench/pgbench.c.rej > > 3. As I'm just applying 0001 and 0006, I've got two simple rejects, > but fixed it (due to not applying missing numa_ freelist patches). > That's intentional on my part, because I wanted to play just with > those two. > > 4. Then I uncomment those two "if (numa_procs_interleave)" related for > optional memory shm initialization - add_size() and so on (that have > XXX comment above that it is causing bootstrap issues) > Ah, I didn't realize you uncommented these "if" conditions. In that case the crash is not very surprising, because the actual initialization in InitProcGlobal ignores the GUCs and just assumes it's enabled. But without the extra padding that likely messes up something. Or something allocated later "overwrites" the some of the memory. I need to clean this up, to actually consider the GUC properly. FWIW I do have a new patch version that I plan to share in a day or two, once I get some numbers. It didn't change this particular part, though, it's more about the buffers/freelists/clocksweep. I'll work on PGPROC next, I think. regards -- Tomas Vondra
Hi, Here's a somewhat cleaned up v3 of this patch series, with various improvements and a lot of cleanup. Still WIP, but I hope it resolves the various crashes reported for v2, but it still requires --with-libnuma (it won't build without it). I'm aware there's an ongoing discussion about removing the freelists, and changing the clocksweep in some way. If that happens, the relevant parts of this series will need some adjustment, of course. I haven't looked into that yet, I plan to review those patches soon. main changes in v3 ------------------ 1) I've introduced "registry" of the buffer partitions (imagine a small array of structs), serving as a source of truth for places that need info about the partitions (range of buffers, ...). With v2 there was no "shared definition" - the shared buffers, freelist and clocksweep did their own thing. But per the discussion it doesn't really make much sense for to partition buffers in different ways. So in v3 the 0001 patch defines the partitions, records them in shared memory (in a small array), and the later parts just reuse this. I also added a pg_buffercache_partitions() listing the partitions, with first/last buffer, etc. The freelist/clocksweep patches add additional information. 2) The PGPROC part introduces a similar registry, even though there are no other patches building on this. But it seemed useful to have a clear place recording this info. There's also a view pg_buffercache_pgproc. The pg_buffercache location is a bit bogus - it has nothing to do with buffers, but it was good enough for now. 3) The PGPROC partitioning is reworked and should fix the crash with the GUC set to "off". 4) This still doesn't do anything about "balancing" the clocksweep. I have some ideas how to do that, I'll work on that next. simple benchmark ---------------- I did a simple benchmark, measuring pgbench throughput with scale still fitting into RAM, but much larger (~2x) than shared buffers. See the attached test script, testing builds with more and more of the patches. I'm attaching results from two different machines (the "usual" 2P xeon and also a much larger cloud instance with EPYC/Genoa) - both the raw CSV files, with average tps and percentiles, and PDFs. The PDFs also have a comparison either to the "preceding" build (right side), or to master (below the table). There's results for the three "pgbench pinning" strategies, and that can have pretty significant impact (colocated generally performs much better than either "none" or "random"). For the "bigger" machine (wiuth 176 cores) the incremental results look like this (for pinning=none, i.e. regular pgbench): mode s_b buffers localal no-tail freelist sweep pgproc pinning ==================================================================== prepared 16GB 99% 101% 100% 103% 111% 99% 102% 32GB 98% 102% 99% 103% 107% 101% 112% 8GB 97% 102% 100% 102% 101% 101% 106% -------------------------------------------------------------------- simple 16GB 100% 100% 99% 105% 108% 99% 108% 32GB 98% 101% 100% 103% 100% 101% 97% 8GB 100% 100% 101% 99% 100% 104% 104% The way I read this is that the first three patches have about no impact on throughput. Then freelist partitioning and (especially) clocksweep partitioning can help quite a bit. pgproc is again close to ~0%, and PGPROC pinning can help again (but this part is merely experimental). For the xeon the differences (in either direction) are much smaller, so I'm not going to post it here. It's in the PDF, though. I think this looks reasonable. The way I see this patch series is not about improving peak throughput, but more about reducing imbalance and making the behavior more consistent. The results are more a confirmation there's not some sort of massive overhead, somewhere. But I'll get to this in a minute. To quantify this kind of improvement, I think we'll need tests that intentionally cause (or try to) imbalance. If you have ideas for such tests, let me know. overhead of partitioning calculation ------------------------------------ Regarding the "overhead", while the results look mostly OK, I think we'll need to rethink the partitioning scheme - particularly how the partition size is calculated. The current scheme has to use %, which can be somewhat expensive. The 0001 patch calculates a "chunk size", which is the smallest number of buffers it can "assign" to a NUMA node. This depends on how many buffer descriptors fit onto a single memory page, and it's either 512KB (with 4KB pages), or 256MB (with 2MB huge pages). And then each NUMA node gets multiple chunks, to cover shared_buffers/num_nodes. But this can be an arbitrary number - it minimizes the imbalance, but it also forces the use of % and / in the formulas. AFAIK if we required the partitions to be 2^k multiples of the chunk size, we could switch to using shifts and masking. Which is supposed to be much faster. But I haven't measured this, and the cost is that some of the nodes could get much less memory. Maybe that's fine. reserving number of huge pages ------------------------------ The other thing I realized is that partitioning buffers with huge pages is quite tricky, and can easily lead to SIGBUS when accessing the memory later. The crashes I saw happen like this: 1) figure # of pages needed (using shared_memory_size_in_huge_pages) This can be 16828 for shared_buffers=32GB. 2) make sure there's enough huge pages echo 16828 > /proc/sys/vm/nr_hugepages 3) start postgres - everything seems to works just fine 4) query pg_buffercache_numa - triggers SIGBUS accessing memory for a valid buffer (usually ~2GB from the end) It took me ages to realize what's happening, but it's very simple. The nr_hugepages is a global limit, but it's also translated into limits for each NUMA node. So when you write 16828 to it, in a 4-node system each node gets 1/4 of that. See $ numastat -cm Then we do the mmap(), and everything looks great, because there really is enough huge pages and the system can allocate memory from any NUMA node it needs. And then we come around, and do the numa_tonode_memory(). And that's where the issues start, because AFAIK this does not check the per-node limit of huge pages in any way. It just appears to work. And then later, when we finally touch the buffer, it tries to actually allocate the memory on the node, and realizes there's not enough huge pages. And triggers the SIGBUS. You may ask why the per-node limit is too low. We still need just shared_memory_size_in_huge_pages, right? And if we were partitioning the whole memory segment, that'd be true. But we only to that for shared buffers, and there's a lot of other shared memory - could be 1-2GB or so, depending on the configuration. And this gets placed on one of the nodes, and it counts against the limit on that particular node. And so it doesn't have enough huge pages to back the partition of shared buffers. The only way around this I found is by inflating the number of huge pages, significantly above the shared_memory_size_in_huge_pages value. Just to make sure the nodes get enough huge pages. I don't know what to do about this. It's quite annoying. If we only used huge pages for the partitioned parts, this wouldn't be a problem. I also realize this can be used to make sure the memory is balanced on NUMA systems. Because if you set nr_hugepages, the kernel will ensure the shared memory is distributed on all the nodes. It won't have the benefits of "coordinating" the buffers and buffer descriptors, and so on. But it will be balanced. regards -- Tomas Vondra
Вложения
- v3-0007-NUMA-pin-backends-to-NUMA-nodes.patch
- v3-0006-NUMA-interleave-PGPROC-entries.patch
- v3-0005-NUMA-clockweep-partitioning.patch
- v3-0004-NUMA-partition-buffer-freelist.patch
- v3-0003-freelist-Don-t-track-tail-of-a-freelist.patch
- v3-0002-NUMA-localalloc.patch
- v3-0001-NUMA-interleaving-buffers.patch
- numa-hb176.csv
- run-huge-pages.sh
- numa-xeon.csv
- numa-xeon-e5-2699.pdf
- numa-hb176.pdf
On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <tomas@vondra.me> wrote: Hi Tomas, just a quick look here: > 2) The PGPROC part introduces a similar registry, [..] > > There's also a view pg_buffercache_pgproc. The pg_buffercache location > is a bit bogus - it has nothing to do with buffers, but it was good > enough for now. If you are looking for better names: pg_shmem_pgproc_numa would sound like a more natural name. > 3) The PGPROC partitioning is reworked and should fix the crash with the > GUC set to "off". Thanks! > simple benchmark > ---------------- [..] > There's results for the three "pgbench pinning" strategies, and that can > have pretty significant impact (colocated generally performs much better > than either "none" or "random"). Hint: real world is that network cards are usually located on some PCI slot that is assigned to certain node (so traffic is flowing from/to there), so probably it would make some sense to put pgbench outside this machine and remove this as "variable" anyway and remove the need for that pgbench --pin-cpus in script. In optimal conditions: most optimized layout would be probably to have 2 cards on separate PCI slots, each for different node and some LACP between those, with xmit_hash_policy allowing traffic distribution on both of those cards -- usually there's not just single IP/MAC out there talking to/from such server, so that would be real-world (or lack of) affinity. Also classic pgbench workload, seems to be poor fit for testing it out (at least v3-0001 buffers), there I would propose sticking to just lots of big (~s_b size) full table seq scans to put stress on shared memory. Classic pgbench is usually not there enough to put serious bandwidth on the interconnect by my measurements. > For the "bigger" machine (wiuth 176 cores) the incremental results look > like this (for pinning=none, i.e. regular pgbench): > > > mode s_b buffers localal no-tail freelist sweep pgproc pinning > ==================================================================== > prepared 16GB 99% 101% 100% 103% 111% 99% 102% > 32GB 98% 102% 99% 103% 107% 101% 112% > 8GB 97% 102% 100% 102% 101% 101% 106% > -------------------------------------------------------------------- > simple 16GB 100% 100% 99% 105% 108% 99% 108% > 32GB 98% 101% 100% 103% 100% 101% 97% > 8GB 100% 100% 101% 99% 100% 104% 104% > > The way I read this is that the first three patches have about no impact > on throughput. Then freelist partitioning and (especially) clocksweep > partitioning can help quite a bit. pgproc is again close to ~0%, and > PGPROC pinning can help again (but this part is merely experimental). Isn't the "pinning" column representing just numa_procs_pin=on ? (shouldn't it be tested with numa_procs_interleave = on?) [..] > To quantify this kind of improvement, I think we'll need tests that > intentionally cause (or try to) imbalance. If you have ideas for such > tests, let me know. Some ideas: 1. concurrent seq scans hitting s_b-sized table 2. one single giant PX-enabled seq scan with $VCPU workers (stresses the importance of interleaving dynamic shm for workers) 3. select txid_current() with -M prepared? > reserving number of huge pages > ------------------------------ [..] > It took me ages to realize what's happening, but it's very simple. The > nr_hugepages is a global limit, but it's also translated into limits for > each NUMA node. So when you write 16828 to it, in a 4-node system each > node gets 1/4 of that. See > > $ numastat -cm > > Then we do the mmap(), and everything looks great, because there really > is enough huge pages and the system can allocate memory from any NUMA > node it needs. Yup, similiar story as with OOMs just for per-zone/node. > And then we come around, and do the numa_tonode_memory(). And that's > where the issues start, because AFAIK this does not check the per-node > limit of huge pages in any way. It just appears to work. And then later, > when we finally touch the buffer, it tries to actually allocate the > memory on the node, and realizes there's not enough huge pages. And > triggers the SIGBUS. I think that's why options for strict policy numa allocation exist and I had the option to use it in my patches (anyway with one big call to numa_interleave_memory() for everything it was much more simpler and just not micromanaging things). Good reads are numa(3) but e.g. mbind(2) underneath will tell you that e.g. `Before Linux 5.7. MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x, but it could be happening for you too if you start using it). Anyway, numa_set_strict() is just wrapper around setting this exact flag Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe that should be always called in your patch series to pre-populate everything during startup, so that others testing will get proper guaranteed layout, even without issuing any pg_buffercache calls. > The only way around this I found is by inflating the number of huge > pages, significantly above the shared_memory_size_in_huge_pages value. > Just to make sure the nodes get enough huge pages. > > I don't know what to do about this. It's quite annoying. If we only used > huge pages for the partitioned parts, this wouldn't be a problem. Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get NUMA affinity, seems like a logical trade-off, doesn't it? But postgres -C shared_memory_size_in_huge_pages still works OK to establish the exact count for vm.nr_hugepages, right? Regards, -J.
On 7/30/25 10:29, Jakub Wartak wrote: > On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi Tomas, > > just a quick look here: > >> 2) The PGPROC part introduces a similar registry, [..] >> >> There's also a view pg_buffercache_pgproc. The pg_buffercache location >> is a bit bogus - it has nothing to do with buffers, but it was good >> enough for now. > > If you are looking for better names: pg_shmem_pgproc_numa would sound > like a more natural name. > >> 3) The PGPROC partitioning is reworked and should fix the crash with the >> GUC set to "off". > > Thanks! > >> simple benchmark >> ---------------- > [..] >> There's results for the three "pgbench pinning" strategies, and that can >> have pretty significant impact (colocated generally performs much better >> than either "none" or "random"). > > Hint: real world is that network cards are usually located on some PCI > slot that is assigned to certain node (so traffic is flowing from/to > there), so probably it would make some sense to put pgbench outside > this machine and remove this as "variable" anyway and remove the need > for that pgbench --pin-cpus in script. In optimal conditions: most > optimized layout would be probably to have 2 cards on separate PCI > slots, each for different node and some LACP between those, with > xmit_hash_policy allowing traffic distribution on both of those cards > -- usually there's not just single IP/MAC out there talking to/from > such server, so that would be real-world (or lack of) affinity. > The pgbench pinning certainly reduces some of the noise / overhead you get when using multiple machines. I use it to "isolate" patches, and make the effects more visible. > Also classic pgbench workload, seems to be poor fit for testing it out > (at least v3-0001 buffers), there I would propose sticking to just > lots of big (~s_b size) full table seq scans to put stress on shared > memory. Classic pgbench is usually not there enough to put serious > bandwidth on the interconnect by my measurements. > Yes, that's possible. The simple pgbench workload is a bit of a "worst case" for the NUMA patches, in that it's can benefit less from the improvements, and it's also fairly sensitive to regressions. I plan to do more tests with other types of workloads, like the one doing a lot of large sequential scans, etc. >> For the "bigger" machine (wiuth 176 cores) the incremental results look >> like this (for pinning=none, i.e. regular pgbench): >> >> >> mode s_b buffers localal no-tail freelist sweep pgproc pinning >> ==================================================================== >> prepared 16GB 99% 101% 100% 103% 111% 99% 102% >> 32GB 98% 102% 99% 103% 107% 101% 112% >> 8GB 97% 102% 100% 102% 101% 101% 106% >> -------------------------------------------------------------------- >> simple 16GB 100% 100% 99% 105% 108% 99% 108% >> 32GB 98% 101% 100% 103% 100% 101% 97% >> 8GB 100% 100% 101% 99% 100% 104% 104% >> >> The way I read this is that the first three patches have about no impact >> on throughput. Then freelist partitioning and (especially) clocksweep >> partitioning can help quite a bit. pgproc is again close to ~0%, and >> PGPROC pinning can help again (but this part is merely experimental). > > Isn't the "pinning" column representing just numa_procs_pin=on ? > (shouldn't it be tested with numa_procs_interleave = on?) > Maybe I don't understand the question, but the last column (pinning) compares two builds. 1) Build with all the patches up to "pgproc interleaving" (and all of the GUCs set to "on"). 2) Build with all the patches from (1), and "pinning" too (again, all GUCs set to "on). Or do I misunderstand the question? > [..] >> To quantify this kind of improvement, I think we'll need tests that >> intentionally cause (or try to) imbalance. If you have ideas for such >> tests, let me know. > > Some ideas: > 1. concurrent seq scans hitting s_b-sized table > 2. one single giant PX-enabled seq scan with $VCPU workers (stresses > the importance of interleaving dynamic shm for workers) > 3. select txid_current() with -M prepared? > Thanks. I think we'll try something like (1), but it'll need to be a bit more elaborate, because scans on tables larger than 1/4 shared buffers use a small circular buffer. >> reserving number of huge pages >> ------------------------------ > [..] >> It took me ages to realize what's happening, but it's very simple. The >> nr_hugepages is a global limit, but it's also translated into limits for >> each NUMA node. So when you write 16828 to it, in a 4-node system each >> node gets 1/4 of that. See >> >> $ numastat -cm >> >> Then we do the mmap(), and everything looks great, because there really >> is enough huge pages and the system can allocate memory from any NUMA >> node it needs. > > Yup, similiar story as with OOMs just for per-zone/node. > >> And then we come around, and do the numa_tonode_memory(). And that's >> where the issues start, because AFAIK this does not check the per-node >> limit of huge pages in any way. It just appears to work. And then later, >> when we finally touch the buffer, it tries to actually allocate the >> memory on the node, and realizes there's not enough huge pages. And >> triggers the SIGBUS. > > I think that's why options for strict policy numa allocation exist and > I had the option to use it in my patches (anyway with one big call to > numa_interleave_memory() for everything it was much more simpler and > just not micromanaging things). Good reads are numa(3) but e.g. > mbind(2) underneath will tell you that e.g. `Before Linux 5.7. > MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x, > but it could be happening for you too if you start using it). Anyway, > numa_set_strict() is just wrapper around setting this exact flag > > Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe > that should be always called in your patch series to pre-populate > everything during startup, so that others testing will get proper > guaranteed layout, even without issuing any pg_buffercache calls. > I think I tried using numa_set_strict, but it didn't change the behavior (i.e. the numa_tonode_memory didn't error out). >> The only way around this I found is by inflating the number of huge >> pages, significantly above the shared_memory_size_in_huge_pages value. >> Just to make sure the nodes get enough huge pages. >> >> I don't know what to do about this. It's quite annoying. If we only used >> huge pages for the partitioned parts, this wouldn't be a problem. > > Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get > NUMA affinity, seems like a logical trade-off, doesn't it? > But postgres -C shared_memory_size_in_huge_pages still works OK to > establish the exact count for vm.nr_hugepages, right? > Well, yes and no. It tells you the exact number of huge pages, but it does not tell you how much you need to inflate it to account for the non-shared buffer part that may get allocated on a random node. regards -- Tomas Vondra
Hi, Here's an updated version of the patch series. The main improvement is the new 0006 patch, adding "adaptive balancing" of allocations. I'll also share some results from a workload doing a lot of allocations. adaptive balancing of allocations --------------------------------- Imagine each backend only allocates buffers from the partition on the same NUMA node. E.g. you have 4 NUMA nodes (i.e. 4 partitions), and a backend only allocates buffers from "home" partition (on the same NUMA node). This is what the earlier patch versions did, and with many backends that's mostly fine (assuming the backends get spread over all the NUMA nodes). But if there's only few backends doing the allocations, this can result in very inefficient use of shared buffers - a single backend would be limited to 25% of buffers, even if the rest is unused. There needs to be some say to "redirect" excess allocations to other partitions, so that the partitions are utilized about the same. This is what the 0006 patch aims to do (I kept is separate, but it should probably get merged into the "clocksweep partitioning" in the end). The balancing is fairly simple: (1) It tracks the number of allocations "requested" from each partition. (2) In regular intervals (by bgwriter) calculate the "fair share" per partition, and determine what fraction of "requests" to handle from the partition itself, and how many to redirect to other partitions. (3) Calculate coefficients to drive this for each partition. I emphasize (1) talks about "requests", not the actual allocations. Some of the requests could have been redirected to different partitions, and be counted as allocations there. We want to balance allocations, but it relies on the requests. To give you a simple example - imagine there are 2 partitions with this number of allocation requests: P1: 900,000 requests P2: 100,000 requests This means the "fair share" is 500,000 allocations, so P1 needs to redirect some requests to P2. And we end with these weights: P1: [ 55, 45] P2: [ 0, 100] Assuming the workload does not shift in some dramatic way, this should result in both partitions handling ~500k allocations. It's not hard to extend this algorithm to more partitions. For more details see StrategySyncBalance(), which recalculates this. There are a couple open questions, like: * The algorithm combines the old/new weights by averaging, to add a bit of hysteresis. Right now it's a simple average with 0.5 weight, to dampen sudden changes. I think it works fine (in the long run), but I'm open to suggestions how to do this better. * There's probably additional things we should consider when deciding where to redirect the allocations. For example, we may have multiple partitions per NUMA node, in which case it's better to redirect to that node as many allocations as possible. The current patch ignores this. * The partitions may have slightly different sizes, but the balancing ignores that for now. This is not very difficult to address. clocksweep benchmark -------------------- I ran a simple benchmark focused on allocation-heavy workloads, namely large concurrent sequential scans. The attached scripts generate a number of 1GB tables, and then run concurrent sequential scans with shared buffers set to 60%, 75%, 90% and 110% of the total dataset size. I did this for master, and with the NUMA patches applied (and the GUCs set to 'on'). I also increased tried with the of partitions increased to 16 (so a NUMA node got multiple partitions). There are results from three machines 1) ryzen - small non-NUMA system, mostly to see if there's regressions 2) xeon - older 2-node NUMA system 3) hb176 - big EPYC system with 176 cores / 4 NUMA nodes The script records detailed TPS stats (e.g. percentiles), I'm attaching CSV files with complete results, and some PDFs with charts summarizing that (I'll get to that in a minute). For the EPYC, the average tps for the three builds looks like this: clients | master numa numa-16 | numa numa-16 ----------|--------------------------------|--------------------- 8 | 20 27 26 | 133% 129% 16 | 23 39 45 | 170% 193% 24 | 23 48 58 | 211% 252% 32 | 21 57 68 | 268% 321% 40 | 21 56 76 | 265% 363% 48 | 22 59 82 | 270% 375% 56 | 22 66 88 | 296% 397% 64 | 23 62 93 | 277% 411% 72 | 24 68 95 | 277% 389% 80 | 24 72 95 | 295% 391% 88 | 25 71 98 | 283% 392% 96 | 26 74 97 | 282% 369% 104 | 26 74 97 | 282% 367% 112 | 27 77 95 | 287% 355% 120 | 28 77 92 | 279% 335% 128 | 27 75 89 | 277% 328% That's not bad - the clocksweep partitioning increases the throughput 2-3x. Having 16 partitions (instead of 4) helps yet a bit more, to 3-4x. This is for shared buffers set to 60% of the dataset, which depends on the number of clients / tables. With 64 clients/tables, there's 64GB of data, and shared buffers are set to ~39GB. The results for 75% and 90% follow the same pattern. For 110% there's much less impact - there are no allocations, so this has to be thanks to the other NUMA patches. The charts in the attached PDFs add a bit more detail, with various percentiles (of per-second throughput). The bands are roughly quartiles: 5-25%, 25-50%, 50-75%, 75-95%. The thick middle line is the median. There's only charts for 60%, 90% and 110% shared buffers, for fit it on a single page. There 75% is not very different. For ryzen there's little difference. Not surprising, it's not a NUMA system. So this is positive result, as there's no regression. For xeon the patches help a little bit. Again, not surprising. It's a fairly old system (~2016), and the differences between NUMA nodes are not that significant. For epyc (hb176), the differences are pretty massive. regards -- Tomas Vondra
Вложения
- numa-benchmark-ryzen.pdf
- numa-benchmark-epyc.pdf
- numa-benchmark-xeon.pdf
- xeon.csv.gz
- ryzen.csv.gz
- hb176.csv.gz
- run.sh
- generate.sh
- v20250804-0001-NUMA-interleaving-buffers.patch.gz
- v20250804-0008-NUMA-pin-backends-to-NUMA-nodes.patch.gz
- v20250804-0007-NUMA-interleave-PGPROC-entries.patch.gz
- v20250804-0006-NUMA-clocksweep-allocation-balancing.patch.gz
- v20250804-0005-NUMA-clockweep-partitioning.patch.gz
- v20250804-0004-NUMA-partition-buffer-freelist.patch.gz
- v20250804-0003-freelist-Don-t-track-tail-of-a-freelist.patch.gz
- v20250804-0002-NUMA-localalloc.patch.gz
Hi! Here's a slightly improved version of the patch series. The main improvement is related to rebalancing partitions of different sizes (which can happen because the sizes have to be a multiple of some minimal "chunk" determined by memory page size etc.). Part 0009 deals with that by adjusting the allocations by partition size. It works OK, but it's also true it matters less as the shared_buffers size increases (as the relative difference between large/small partition gets smaller). The other improvements are related to the pg_buffercache_partitions view, showing the weights and (running) totals of allocations. I plan to take a break from this patch series for a while, so this would be a good time to take a look, do a review, run some tests etc. ;-) One detail about the balancing I forgot to mention in my last message is how the patch "distributes" allocations to match the balancing weights. Consider for example the example weights from that message: P1: [ 55, 45] P2: [ 0, 100] Imagine a backend located on P1 requests allocation of a buffer. The weights say 55% buffers should be allocated from P1, and 45% should be redirected to P2. One way to achieve that would be generating a random number in [1, 100], and if it's [1,55] then P1, otherwise P2. The patch does a much simpler thing - treat the weight as a "budget", i.e. number of buffers to allocate before proceeding to the "next" partition. So it allocates 55 buffers from P1, then 45 buffers from P2, and then goes back to P1 in a round-robin way. The advantage is it can do away without a PRNG. There's two things I'm not entirely sure about: 1) memory model - I'm not quite sure the current code ensures updates to weights are properly "communicated" to the other processes. That is, if the bgwriter recalculates the weights, will the other backends see the new weights right away? Using a stale weights won't cause "failures", the consequence is just a bit of imbalance. But it shouldn't stay like that for too long, so maybe it'd be good to add some memory barriers or something like that. 2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly assumes each core / piece of RAM is assigned to a particular NUMA node. For the buffer partitioning the patch mostly cares about memory, as it "locates" the buffers on different NUMA nodes. Which works mostly OK (ignoring the issues with huge pages described in previous message). But it also cares about the cores (and the node for each core), because it uses that to pick the right partition for a backend. And here the situation is less clear, because the CPUs don't need to be assigned to a particular node, even on a NUMA system. Consider the rpi5 NUMA layout: $ numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 node 0 size: 992 MB node 0 free: 274 MB node 1 cpus: 0 1 2 3 node 1 size: 1019 MB node 1 free: 327 MB node 2 cpus: 0 1 2 3 node 2 size: 1019 MB node 2 free: 321 MB node 3 cpus: 0 1 2 3 node 3 size: 955 MB node 3 free: 251 MB node 4 cpus: 0 1 2 3 node 4 size: 1019 MB node 4 free: 332 MB node 5 cpus: 0 1 2 3 node 5 size: 1019 MB node 5 free: 342 MB node 6 cpus: 0 1 2 3 node 6 size: 1019 MB node 6 free: 352 MB node 7 cpus: 0 1 2 3 node 7 size: 1014 MB node 7 free: 339 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 10 10 10 10 10 10 10 1: 10 10 10 10 10 10 10 10 2: 10 10 10 10 10 10 10 10 3: 10 10 10 10 10 10 10 10 4: 10 10 10 10 10 10 10 10 5: 10 10 10 10 10 10 10 10 6: 10 10 10 10 10 10 10 10 7: 10 10 10 10 10 10 10 10 This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores are not assigned to particular nodes - each core is mapped to all 8 NUMA nodes. I'm not sure what to do about this (or how getcpu() or libnuma handle this). And can the situation be even more complicated? regards -- Tomas Vondra
Вложения
- v20250807-0011-NUMA-pin-backends-to-NUMA-nodes.patch
- v20250807-0010-NUMA-interleave-PGPROC-entries.patch
- v20250807-0009-NUMA-weighted-clocksweep-balancing.patch
- v20250807-0008-NUMA-clocksweep-allocation-balancing.patch
- v20250807-0007-NUMA-clockweep-partitioning.patch
- v20250807-0006-NUMA-partition-buffer-freelist.patch
- v20250807-0005-freelist-Don-t-track-tail-of-a-freelist.patch
- v20250807-0004-NUMA-localalloc.patch
- v20250807-0003-NUMA-interleaving-buffers.patch
- v20250807-0002-nbtree-Use-ReadRecentBuffer-in_bt_getroot.patch
- v20250807-0001-allow-pgbench-to-pin-backends-threads.patch
On 8/7/25 11:24, Tomas Vondra wrote: > Hi! > > Here's a slightly improved version of the patch series. > Ah, I made a mistake when generating the patches. The 0001 and 0002 patches are not part of the NUMA stuff, it's just something related to benchmarking (addressing unrelated bottlenecks etc.). The actual NUMA patches start with 0003. Also, 0007, 0008 and 0009 should ultimately be collapsed into a single patch. It's all about the clocksweep partitioning, I only kept those separate to make it easier to see the changes and review. regards -- Tomas Vondra
Hi, On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote: > 2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly > assumes each core / piece of RAM is assigned to a particular NUMA node. There are systems in which some NUMA nodes do *not* contain any CPUs. E.g. if you attach memory via a CXL/PCIe add-in card, rather than via the CPUs memory controller. In that case numactl -H (and obviously also the libnuma APIs) will report that the numa node is not associated with any CPU. I don't currently have live access to such a system, but this PR piece happens to have numactl -H output: https://lenovopress.lenovo.com/lp2184-implementing-cxl-memory-on-linux-on-thinksystem-v4-servers > numactl -H > available: 4 nodes (0-3) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 3839 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 > node 0 size: 1031904 MB > node 0 free: 1025554 MB > node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 8384 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 > node 1 size: 1032105 MB > node 1 free: 1024244 MB > node 2 cpus: > node 2 size: 262144 MB > node 2 free: 262143 MB > node 3 cpus: > node 3 size: 262144 MB > node 3 free: 262142 MB > node distances: > node 0 1 2 3 > 0: 10 21 14 24 > 1: 21 10 24 14 > 2: 14 24 10 26 > 3: 24 14 26 10 Note that node 2 & 3 don't have associated CPUs (and higher access costs). I don't think this is common enough to worry about from a performance POV, but we probably shouldn't crash if we encounter it... > But it also cares about the cores (and the node for each core), because > it uses that to pick the right partition for a backend. And here the > situation is less clear, because the CPUs don't need to be assigned to a > particular node, even on a NUMA system. Consider the rpi5 NUMA layout: > > $ numactl --hardware > available: 8 nodes (0-7) > node 0 cpus: 0 1 2 3 > node 0 size: 992 MB > node 0 free: 274 MB > node 1 cpus: 0 1 2 3 > node 1 size: 1019 MB > node 1 free: 327 MB > ... > node 0 1 2 3 4 5 6 7 > 0: 10 10 10 10 10 10 10 10 > 1: 10 10 10 10 10 10 10 10 > 2: 10 10 10 10 10 10 10 10 > 3: 10 10 10 10 10 10 10 10 > 4: 10 10 10 10 10 10 10 10 > 5: 10 10 10 10 10 10 10 10 > 6: 10 10 10 10 10 10 10 10 > 7: 10 10 10 10 10 10 10 10 > This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores > are not assigned to particular nodes - each core is mapped to all 8 NUMA > nodes. FWIW, you can get a different version of this with AMD Epyc too, if "L3 LLC as NUMA" is enabled. > I'm not sure what to do about this (or how getcpu() or libnuma handle this). I don't immediately see any libnuma functions that would care? I also am somewhat curious about what getcpu() returns for the current node... Greetings, Andres Freund
On 8/9/25 02:25, Andres Freund wrote: > Hi, > > On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote: >> 2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly >> assumes each core / piece of RAM is assigned to a particular NUMA node. > > There are systems in which some NUMA nodes do *not* contain any CPUs. E.g. if > you attach memory via a CXL/PCIe add-in card, rather than via the CPUs memory > controller. In that case numactl -H (and obviously also the libnuma APIs) will > report that the numa node is not associated with any CPU. > > I don't currently have live access to such a system, but this PR piece happens > to have numactl -H output: > https://lenovopress.lenovo.com/lp2184-implementing-cxl-memory-on-linux-on-thinksystem-v4-servers >> numactl -H >> available: 4 nodes (0-3) >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 3839 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 >> node 0 size: 1031904 MB >> node 0 free: 1025554 MB >> node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 8283 84 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 >> node 1 size: 1032105 MB >> node 1 free: 1024244 MB >> node 2 cpus: >> node 2 size: 262144 MB >> node 2 free: 262143 MB >> node 3 cpus: >> node 3 size: 262144 MB >> node 3 free: 262142 MB >> node distances: >> node 0 1 2 3 >> 0: 10 21 14 24 >> 1: 21 10 24 14 >> 2: 14 24 10 26 >> 3: 24 14 26 10 > > Note that node 2 & 3 don't have associated CPUs (and higher access costs). > > I don't think this is common enough to worry about from a performance POV, but > we probably shouldn't crash if we encounter it... > Right. I don't think the current patch would crash - I can't test it, but I don't see why it would crash. In the worst case it'd end up with partitions that are not ideal. The question is more what would an ideal partitioning for buffers and PGPROC look like. Any opinions? For PGPROC, it's simple - it doesn't make sense to allocate partitions for nodes without CPUs. For buffers, it probably does not really matter if a node does not have any CPUs. If a node does not have any CPUs, that does not mean we should not put any buffers on it. After all, CXL will never have any CPUs (at least I think that's the case), and not using it for shared buffers would be a bit strange. Although, it could still be used for page cache. Maybe it should be "tiered" a bit more? The patch differentiates only between partitions on "my" NUMA node vs. every other partition. Maybe it should have more layers? That'd make the "balancing" harder. But I wanted to make it a smarter when handling cases with multiple partitions per NUMA node. > >> But it also cares about the cores (and the node for each core), because >> it uses that to pick the right partition for a backend. And here the >> situation is less clear, because the CPUs don't need to be assigned to a >> particular node, even on a NUMA system. Consider the rpi5 NUMA layout: >> >> $ numactl --hardware >> available: 8 nodes (0-7) >> node 0 cpus: 0 1 2 3 >> node 0 size: 992 MB >> node 0 free: 274 MB >> node 1 cpus: 0 1 2 3 >> node 1 size: 1019 MB >> node 1 free: 327 MB >> ... >> node 0 1 2 3 4 5 6 7 >> 0: 10 10 10 10 10 10 10 10 >> 1: 10 10 10 10 10 10 10 10 >> 2: 10 10 10 10 10 10 10 10 >> 3: 10 10 10 10 10 10 10 10 >> 4: 10 10 10 10 10 10 10 10 >> 5: 10 10 10 10 10 10 10 10 >> 6: 10 10 10 10 10 10 10 10 >> 7: 10 10 10 10 10 10 10 10 >> This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores >> are not assigned to particular nodes - each core is mapped to all 8 NUMA >> nodes. > > FWIW, you can get a different version of this with AMD Epyc too, if "L3 LLC as > NUMA" is enabled. > > >> I'm not sure what to do about this (or how getcpu() or libnuma handle this). > > I don't immediately see any libnuma functions that would care? > Not sure what "care" means here. I don't think it's necessarily broken, it's more about the APIs not making the situation very clear (or convenient). How do you determine nodes for a CPU, for example? The closest thing I see is numa_node_of_cpu(), but that only returns a single node. Or how would you determine the number of nodes with CPUs (so that we create PGPROC partitions only for those)? I suppose that requires literally walking all the nodes. > I also am somewhat curious about what getcpu() returns for the current node... > It seems it only returns node 0. The cpu changes, but the node does not. regards -- Tomas Vondra
Hi, On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote: > Right. I don't think the current patch would crash - I can't test it, > but I don't see why it would crash. In the worst case it'd end up with > partitions that are not ideal. The question is more what would an ideal > partitioning for buffers and PGPROC look like. Any opinions? > > For PGPROC, it's simple - it doesn't make sense to allocate partitions > for nodes without CPUs. > > For buffers, it probably does not really matter if a node does not have > any CPUs. If a node does not have any CPUs, that does not mean we should > not put any buffers on it. After all, CXL will never have any CPUs (at > least I think that's the case), and not using it for shared buffers > would be a bit strange. Although, it could still be used for page cache. For CXL memory to be really usable, I think we'd need nontrivial additional work. CXL memory has considerably higher latency and lower throughput. You'd *never* want things like BufferDescs or such on such nodes. And even the buffered data itself, you'd want to make sure that frequently used data, e.g. inner index pages, never end up on it. Which leads to: > Maybe it should be "tiered" a bit more? Yes, for proper CXL support, we'd need a component that explicitly demotes and promotes pages from "real" memory to CXL memory and the other way round. The demotion is relatively easy, you'd probably just do it whenever you'd otherwise throw out a victim buffer. When to promote back is harder... > The patch differentiates only between partitions on "my" NUMA node vs. every > other partition. Maybe it should have more layers? Given the relative unavailability of CXL memory systems, I think just not crashing is good enough for now... > >> I'm not sure what to do about this (or how getcpu() or libnuma handle this). > > > > I don't immediately see any libnuma functions that would care? > > > > Not sure what "care" means here. I don't think it's necessarily broken, > it's more about the APIs not making the situation very clear (or > convenient). What I mean is that I was looking through the libnuma functions and didn't see any that would be affected by having multiple "local" NUMA nodes. But: > How do you determine nodes for a CPU, for example? The closest thing I > see is numa_node_of_cpu(), but that only returns a single node. Or how > would you determine the number of nodes with CPUs (so that we create > PGPROC partitions only for those)? I suppose that requires literally > walking all the nodes. I didn't think of numa_node_of_cpu(). As long as numa_node_of_cpu() returns *something* I think it may be good enough. Nobody uses an RPi for high-throughput postgres workloads with a lot of memory. Slightly sub-optimal mappings should really not matter. I'm kinda wondering if we should deal with such fake numa systems by detecting them and disabling our numa support. Greetings, Andres Freund
On 8/12/25 16:24, Andres Freund wrote: > Hi, > > On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote: >> Right. I don't think the current patch would crash - I can't test it, >> but I don't see why it would crash. In the worst case it'd end up with >> partitions that are not ideal. The question is more what would an ideal >> partitioning for buffers and PGPROC look like. Any opinions? >> >> For PGPROC, it's simple - it doesn't make sense to allocate partitions >> for nodes without CPUs. >> >> For buffers, it probably does not really matter if a node does not have >> any CPUs. If a node does not have any CPUs, that does not mean we should >> not put any buffers on it. After all, CXL will never have any CPUs (at >> least I think that's the case), and not using it for shared buffers >> would be a bit strange. Although, it could still be used for page cache. > > For CXL memory to be really usable, I think we'd need nontrivial additional > work. CXL memory has considerably higher latency and lower throughput. You'd > *never* want things like BufferDescs or such on such nodes. And even the > buffered data itself, you'd want to make sure that frequently used data, > e.g. inner index pages, never end up on it. > OK, let's keep that out of scope for these patches and assume we're dealing only with local memory. CXL could still be used by the OS for page cache, of whatever. What does that mean for the patch, though. Does it need a way to configure which nodes to use? I argued to leave this to the OS/numactl, and we'd just use whatever is made available to Postgres. But maybe we'll need something within Postgres after all? FWIW there's work needed a actually inherit NUMA info from the OS. Right now the patches just use all NUMA nodes, indexed by 0 ... (N-1) etc. I like the "registry" concept I used for buffer/PGPROC partitions, it made the patches much simpler. Maybe we should use something like that for NUMA info too. That is, at startup build a record of the NUMA layout, and use this as source of truth everywhere (instead of using libnuma from all those places). > Which leads to: > >> Maybe it should be "tiered" a bit more? > > Yes, for proper CXL support, we'd need a component that explicitly demotes and > promotes pages from "real" memory to CXL memory and the other way round. The > demotion is relatively easy, you'd probably just do it whenever you'd > otherwise throw out a victim buffer. When to promote back is harder... > Sounds very much like page cache (but that only works for buffered I/O). > >> The patch differentiates only between partitions on "my" NUMA node vs. every >> other partition. Maybe it should have more layers? > > Given the relative unavailability of CXL memory systems, I think just not > crashing is good enough for now... > The lowest of bars ;-) > >>>> I'm not sure what to do about this (or how getcpu() or libnuma handle this). >>> >>> I don't immediately see any libnuma functions that would care? >>> >> >> Not sure what "care" means here. I don't think it's necessarily broken, >> it's more about the APIs not making the situation very clear (or >> convenient). > > What I mean is that I was looking through the libnuma functions and didn't see > any that would be affected by having multiple "local" NUMA nodes. But: > My question is a bit of a "reverse" to this. That is, how do we even find (with libnuma) there are multiple local nodes? > >> How do you determine nodes for a CPU, for example? The closest thing I >> see is numa_node_of_cpu(), but that only returns a single node. Or how >> would you determine the number of nodes with CPUs (so that we create >> PGPROC partitions only for those)? I suppose that requires literally >> walking all the nodes. > > I didn't think of numa_node_of_cpu(). > Yeah. I think most of the libnuma API is designed for each CPU belonging to single NUMA node. I suppose we'd need to use numa_node_to_cpus() to build this kind of information ourselves. > As long as numa_node_of_cpu() returns *something* I think it may be good > enough. Nobody uses an RPi for high-throughput postgres workloads with a lot > of memory. Slightly sub-optimal mappings should really not matter. > I'm not really concerned about rpi, or the performance on it. I only use it as an example of system with "weird" NUMA layout. > I'm kinda wondering if we should deal with such fake numa systems by detecting > them and disabling our numa support. > That'd be an option too, if we can identify such systems. We could do that while building the "NUMA registry" I mentioned earlier. regards -- Tomas Vondra
Hi, On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote: > The patch does a much simpler thing - treat the weight as a "budget", > i.e. number of buffers to allocate before proceeding to the "next" > partition. So it allocates 55 buffers from P1, then 45 buffers from P2, > and then goes back to P1 in a round-robin way. The advantage is it can > do away without a PRNG. I think that's a good plan. A few comments about the clock sweep patch: - It'd be easier to review if BgBufferSync() weren't basically re-indented wholesale. Maybe you could instead move the relevant code to a helper function that's called by BgBufferSync() for each clock? - I think choosing a clock sweep partition in every tick would likely show up in workloads that do a lot of buffer replacement, particularly if buffers in the workload often have a high usagecount (and thus more ticks are used). Given that your balancing approach "sticks" with a partition for a while, could we perhaps only choose the partition after exhausting that budget? - I don't really understand what > + /* > + * Buffers that should have been allocated in this partition (but might > + * have been redirected to keep allocations balanced). > + */ > + pg_atomic_uint32 numRequestedAllocs; > + is intended for. Adding yet another atomic increment for every clock sweep tick seems rather expensive... - I wonder if the balancing budgets being relatively low will be good enough. It's not too hard to imagine that this frequent "partition choosing" will be bad in buffer access heavy workloads. But it's probably the right approach until we've measured it being a problem. - It'd be interesting to do some very simple evaluation like a single pg_prewarm() of a relation that's close to the size of shared buffers and verify that we don't end up evicting newly read in buffers. I think your approach should work, but verifying that... I wonder if we could make some of this into tests somehow. It's pretty easy to break this kind of thing and not notice, as everything just continues to work, just a tad slower. Greetings, Andres Freund
On 8/13/25 17:16, Andres Freund wrote: > Hi, > > On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote: >> The patch does a much simpler thing - treat the weight as a "budget", >> i.e. number of buffers to allocate before proceeding to the "next" >> partition. So it allocates 55 buffers from P1, then 45 buffers from P2, >> and then goes back to P1 in a round-robin way. The advantage is it can >> do away without a PRNG. > > I think that's a good plan. > > > A few comments about the clock sweep patch: > > - It'd be easier to review if BgBufferSync() weren't basically re-indented > wholesale. Maybe you could instead move the relevant code to a helper > function that's called by BgBufferSync() for each clock? > True, I'll rework it like that. > - I think choosing a clock sweep partition in every tick would likely show up > in workloads that do a lot of buffer replacement, particularly if buffers > in the workload often have a high usagecount (and thus more ticks are used). > Given that your balancing approach "sticks" with a partition for a while, > could we perhaps only choose the partition after exhausting that budget? > That should be possible, yes. By "exhausting budget" you mean going through all the partitions, right? > - I don't really understand what > >> + /* >> + * Buffers that should have been allocated in this partition (but might >> + * have been redirected to keep allocations balanced). >> + */ >> + pg_atomic_uint32 numRequestedAllocs; >> + > > is intended for. > > Adding yet another atomic increment for every clock sweep tick seems rather > expensive... > For the balancing (to calculate the budgets), we need to know the number of allocation requests for each partition, before some of the requests got redirected to other partitions. We can't use the number of "actual" allocations. But it seems useful to have both - one to calculate the budgets, the other to monitor how balanced the result is. I haven't seen the extra atomic in profiles, even on workloads that do a lot of buffer allocations (e.g. seqscan with datasets > shared buffers). But if that happens, I think there are ways to mitigate that. > > - I wonder if the balancing budgets being relatively low will be good > enough. It's not too hard to imagine that this frequent "partition choosing" > will be bad in buffer access heavy workloads. But it's probably the right > approach until we've measured it being a problem. > I don't follow. How would making the budgets higher change any of this? Anyway, I think choosing the partitions less frequently - e.g. only after consuming budget for the current partition, or going "full cycle", would make this a non-issue. > > - It'd be interesting to do some very simple evaluation like a single > pg_prewarm() of a relation that's close to the size of shared buffers and > verify that we don't end up evicting newly read in buffers. I think your > approach should work, but verifying that... > Will try. > I wonder if we could make some of this into tests somehow. It's pretty easy > to break this kind of thing and not notice, as everything just continues to > work, just a tad slower. > Do you mean a test that'd be a part of make check, or a standalone test? AFAICS any meaningful test would need to be fairly expensive, so probably not a good fit for make check. regards -- Tomas Vondra
Hi, Here's a fresh version of the NUMA patch series. There's a number of substantial improvements: 1) Rebase to current master, particularly on top of 2c7894052759 which removed the freelist. The patch that partitioned the freelist is gone. 2) A bunch of fixes, and it now passes CI workflows on github, and all other testing I did. Of course, more testing is needed. 3) It builds with/without libnuma support, and so on. 4) The separate GUCs were replaced by a single list GUC, similar to what we do for debug_io_direct. The GUC is called debug_numa, and accepts "buffets" and "procs", to partition the two areas. I'm considering adding a "clock-sweep" option in a future patch. I didn't do that here, because the buffer partitioning is already enabled by "buffers", and the clock-sweep just builds on that (partitioning the same way, pretty much). 5) It also works with EXEC_BACKEND, but this turned out to be a bit more challenging than expected. The trouble is some of the parameters (e.g. memory page size) are used both in the "size" and "init" phases, and we need to be extra careful to not make something "inconsistent". For example, we may get confused about the memory page size. The "size" happens before allocation, and at that point we don't know if we succeed in getting enough huge pages. When "init" happens, we already know that, so our "memory page size" could be different. We must be careful, e.g. to not need more memory than we requested. This is a general problem, but the EXEC_BACKEND makes it a bit trickier. In regular fork() case we can simply set some global variables in "size" and use them later in "init", but that doesn't work for EXEC_BACKEND. The current approach simply does the calculations twice, in a way that should end with the same results. But I'm not 100% happy with it, and I suspect it just confirms we should store the results in shmem memory (which is what I called "NUMA registry" before). That's still a TODO. 6) BufferDescPadded is 64B everywhere. Originally, the padding was applied only on 64-bit platforms, and on 32-bit systems the struct was left at 60B. But that is not compatible with buffer partitioning, which relies on the memory page being a multiple of BufferDescPadded. Perhaps it could be relaxed (so that a BufferDesc might span two memory pages), but this seems like the cleanes solution. I don't expect it to make any measurable difference. 7) I've moved some of the code to BgBufferSyncPartition, to make review easier (without all the indentation changes). 8) I've realized some of the TAP tests occasionally fail with ERROR: no unpinned buffers and I think I know why. Some of the tests set shared_buffers to a very low value - like 1MB or even 128kB, and StrategyGetBuffer() may search only a single partition (but not always). We may run out of unpinned buffers in that one partition. This apparently happens more easily on rpi5, due to the weird NUMA layout (there are 8 nodes with memory, but getcpu() reports node 0 for all cores). I suspect the correct fix is to ensure StrategyGetBuffer() scans all partitions, if there are no unpinned buffers in the current one. On realistic setups this shouldn't happen very often, I think. The other issue I just realized is that StrategyGetBuffer() recalculates the partition index over and over, which seems unnecessary (and possibly expensive, due to the modulo). And it also does too many loops, because it used NBuffers instead of the partition size. I'll fix those later. 9) I'm keeping the cloc-sweep balancing patches separate for now. In the end all the clock-sweep patches should be merged, but it keeps the changes easier to review this way, I think. 10) There's not many changes in the PGPROC partitioning patch. I merely fixed issues that broke it on EXEC_BACKEND, and did some smaller tweaks. 11) I'm not including the experimental patches to pin backends to CPUs (or nodes), and so on. It's clear those are unlikely to go in, so it'd be just a distraction. 12) What's the right / portable way to determine the current CPU for a process? The clock-sweep / PGPROC patches need this to pick the right partition, but it's not clear to me which API is the most portable. In the end I used sched_getcpu(), and then numa_node_of_cpu(), but maybe there's a better way. regards -- Tomas Vondra
Вложения
On 9/11/25 10:32, Tomas Vondra wrote: > ... > > For example, we may get confused about the memory page size. The "size" > happens before allocation, and at that point we don't know if we succeed > in getting enough huge pages. When "init" happens, we already know that, > so our "memory page size" could be different. We must be careful, e.g. > to not need more memory than we requested. I forgot to mention the other issue with huge pages on NUMA. I already reported [1] it's trivial to crash with a SIGBUS, because (1) huge pages get reserved on all NUMA nodes (evenly) (2) the decision whether to use huge pages is done by mmap(), which only needs to check if there are enough huge pages in total (3) numa_tonode_memory is called later, and does not verify if the target node has enough free pages (I'm not sure it should / can) (4) we only partition (and locate to NUMA nodes) some of the memory, and the rest (which is much smaller, but still sizeable) is likely causing "imbalance" - it gets placed on one (random) node, and it then does not have enough space for the stuff we explicitly placed there (5) then at some point we try accessing one of the shared buffers, that triggers page fault, tries to get a huge page on the NUMA node, realizes there are no free huge pages, and crashes with SIGBUS It clearly is not an option to just let it crash, but I still don't have a great idea how to address it. The only idea I have is to manually interleave the whole shared memory (when using huge pages), page by page, so that this imbalance does not happen. But it's harder than it looks, because we don't necessarily partition everything evenly. For example, one node can get a smaller chunk of shared buffers, because we try to partition buffers and buffers descriptors in a "nice" way. The PGPROC stuff is also not distributed quite evenly (e.g. aux/2pc entries are not mapped to any node). A different approach would be to calculate how many per-node huge pages we'll need (for the stuff we partition explicitly - buffers and PGPROC), and then the rest of the memory that can get placed on any node. And require the "maximum" number of pages that can get placed on any node. But that's annoying wasteful, because every other node will end up with unusable memory. regards [1] https://www.postgresql.org/message-id/71a46484-053c-4b81-ba32-ddac050a8b5d%40vondra.me -- Tomas Vondra
On 9/11/25 10:32, Tomas Vondra wrote: > ... > > 8) I've realized some of the TAP tests occasionally fail with > > ERROR: no unpinned buffers > > and I think I know why. Some of the tests set shared_buffers to a very > low value - like 1MB or even 128kB, and StrategyGetBuffer() may search > only a single partition (but not always). We may run out of unpinned > buffers in that one partition. > > This apparently happens more easily on rpi5, due to the weird NUMA > layout (there are 8 nodes with memory, but getcpu() reports node 0 for > all cores). > > I suspect the correct fix is to ensure StrategyGetBuffer() scans all > partitions, if there are no unpinned buffers in the current one. On > realistic setups this shouldn't happen very often, I think. > > The other issue I just realized is that StrategyGetBuffer() recalculates > the partition index over and over, which seems unnecessary (and possibly > expensive, due to the modulo). And it also does too many loops, because > it used NBuffers instead of the partition size. I'll fix those later. Here's a version fixing this issue (in the 0006 part). It modifies StrategyGetBuffer() to walk through all the partitions, in a round-robin manner. The way it steps to the next partition is a bit ugly, but it works and I'll think about some better way. I haven't done anything about the other issue (the one with huge pages reserved on NUMA nodes, and SIGBUS). regards -- Tomas Vondra
Вложения
- v20250918-0001-NUMA-shared-buffers-partitioning.patch
- v20250918-0002-NUMA-clockweep-partitioning.patch
- v20250918-0003-NUMA-clocksweep-allocation-balancing.patch
- v20250918-0004-NUMA-weighted-clocksweep-balancing.patch
- v20250918-0005-NUMA-partition-PGPROC.patch
- v20250918-0006-fixup-StrategyGetBuffer.patch