Re: Adding basic NUMA awareness
От | Tomas Vondra |
---|---|
Тема | Re: Adding basic NUMA awareness |
Дата | |
Msg-id | 51e51832-7f47-412a-a1a6-b972101cc8cb@vondra.me обсуждение исходный текст |
Ответ на | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
Список | pgsql-hackers |
Hi, Here's an updated version of the patch series. The main improvement is the new 0006 patch, adding "adaptive balancing" of allocations. I'll also share some results from a workload doing a lot of allocations. adaptive balancing of allocations --------------------------------- Imagine each backend only allocates buffers from the partition on the same NUMA node. E.g. you have 4 NUMA nodes (i.e. 4 partitions), and a backend only allocates buffers from "home" partition (on the same NUMA node). This is what the earlier patch versions did, and with many backends that's mostly fine (assuming the backends get spread over all the NUMA nodes). But if there's only few backends doing the allocations, this can result in very inefficient use of shared buffers - a single backend would be limited to 25% of buffers, even if the rest is unused. There needs to be some say to "redirect" excess allocations to other partitions, so that the partitions are utilized about the same. This is what the 0006 patch aims to do (I kept is separate, but it should probably get merged into the "clocksweep partitioning" in the end). The balancing is fairly simple: (1) It tracks the number of allocations "requested" from each partition. (2) In regular intervals (by bgwriter) calculate the "fair share" per partition, and determine what fraction of "requests" to handle from the partition itself, and how many to redirect to other partitions. (3) Calculate coefficients to drive this for each partition. I emphasize (1) talks about "requests", not the actual allocations. Some of the requests could have been redirected to different partitions, and be counted as allocations there. We want to balance allocations, but it relies on the requests. To give you a simple example - imagine there are 2 partitions with this number of allocation requests: P1: 900,000 requests P2: 100,000 requests This means the "fair share" is 500,000 allocations, so P1 needs to redirect some requests to P2. And we end with these weights: P1: [ 55, 45] P2: [ 0, 100] Assuming the workload does not shift in some dramatic way, this should result in both partitions handling ~500k allocations. It's not hard to extend this algorithm to more partitions. For more details see StrategySyncBalance(), which recalculates this. There are a couple open questions, like: * The algorithm combines the old/new weights by averaging, to add a bit of hysteresis. Right now it's a simple average with 0.5 weight, to dampen sudden changes. I think it works fine (in the long run), but I'm open to suggestions how to do this better. * There's probably additional things we should consider when deciding where to redirect the allocations. For example, we may have multiple partitions per NUMA node, in which case it's better to redirect to that node as many allocations as possible. The current patch ignores this. * The partitions may have slightly different sizes, but the balancing ignores that for now. This is not very difficult to address. clocksweep benchmark -------------------- I ran a simple benchmark focused on allocation-heavy workloads, namely large concurrent sequential scans. The attached scripts generate a number of 1GB tables, and then run concurrent sequential scans with shared buffers set to 60%, 75%, 90% and 110% of the total dataset size. I did this for master, and with the NUMA patches applied (and the GUCs set to 'on'). I also increased tried with the of partitions increased to 16 (so a NUMA node got multiple partitions). There are results from three machines 1) ryzen - small non-NUMA system, mostly to see if there's regressions 2) xeon - older 2-node NUMA system 3) hb176 - big EPYC system with 176 cores / 4 NUMA nodes The script records detailed TPS stats (e.g. percentiles), I'm attaching CSV files with complete results, and some PDFs with charts summarizing that (I'll get to that in a minute). For the EPYC, the average tps for the three builds looks like this: clients | master numa numa-16 | numa numa-16 ----------|--------------------------------|--------------------- 8 | 20 27 26 | 133% 129% 16 | 23 39 45 | 170% 193% 24 | 23 48 58 | 211% 252% 32 | 21 57 68 | 268% 321% 40 | 21 56 76 | 265% 363% 48 | 22 59 82 | 270% 375% 56 | 22 66 88 | 296% 397% 64 | 23 62 93 | 277% 411% 72 | 24 68 95 | 277% 389% 80 | 24 72 95 | 295% 391% 88 | 25 71 98 | 283% 392% 96 | 26 74 97 | 282% 369% 104 | 26 74 97 | 282% 367% 112 | 27 77 95 | 287% 355% 120 | 28 77 92 | 279% 335% 128 | 27 75 89 | 277% 328% That's not bad - the clocksweep partitioning increases the throughput 2-3x. Having 16 partitions (instead of 4) helps yet a bit more, to 3-4x. This is for shared buffers set to 60% of the dataset, which depends on the number of clients / tables. With 64 clients/tables, there's 64GB of data, and shared buffers are set to ~39GB. The results for 75% and 90% follow the same pattern. For 110% there's much less impact - there are no allocations, so this has to be thanks to the other NUMA patches. The charts in the attached PDFs add a bit more detail, with various percentiles (of per-second throughput). The bands are roughly quartiles: 5-25%, 25-50%, 50-75%, 75-95%. The thick middle line is the median. There's only charts for 60%, 90% and 110% shared buffers, for fit it on a single page. There 75% is not very different. For ryzen there's little difference. Not surprising, it's not a NUMA system. So this is positive result, as there's no regression. For xeon the patches help a little bit. Again, not surprising. It's a fairly old system (~2016), and the differences between NUMA nodes are not that significant. For epyc (hb176), the differences are pretty massive. regards -- Tomas Vondra
Вложения
- numa-benchmark-ryzen.pdf
- numa-benchmark-epyc.pdf
- numa-benchmark-xeon.pdf
- xeon.csv.gz
- ryzen.csv.gz
- hb176.csv.gz
- run.sh
- generate.sh
- v20250804-0001-NUMA-interleaving-buffers.patch.gz
- v20250804-0008-NUMA-pin-backends-to-NUMA-nodes.patch.gz
- v20250804-0007-NUMA-interleave-PGPROC-entries.patch.gz
- v20250804-0006-NUMA-clocksweep-allocation-balancing.patch.gz
- v20250804-0005-NUMA-clockweep-partitioning.patch.gz
- v20250804-0004-NUMA-partition-buffer-freelist.patch.gz
- v20250804-0003-freelist-Don-t-track-tail-of-a-freelist.patch.gz
- v20250804-0002-NUMA-localalloc.patch.gz
В списке pgsql-hackers по дате отправления: