Обсуждение: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Salvatore Dipietro
Дата:
Hi, we would like to propose the removal of the Instruction Synchronization Barrier (isb) for aarch64 architectures. Based on our testing on Graviton instances (m7g.16xlarge), we can see on average over multiple iterations up to 12% better performance using PGBench select-only and up to 9% with Sysbench oltp_read_only workloads. On Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench select-only and up to 6% with Sysbench oltp_read_only workloads. We have also tested it putting more pressure on the spin_delay function, enabling pg_stat_statements.track_planning with PGBench read-only [0] and, on average, the patch shows up to 27% better performance on m6g.16xlarge and up to 37% on m7g.16xlarge. Testing environment: - PostgreSQL version: 17.2 - Operating System: Ubuntu 22.04 - Test Platform: AWS Graviton instances (m6g.16xlarge, m7g.16xlarge and m8g.24xlarge) Our benchmark results on PGBench select-only without pg_stat_statements.track_planning: ``` # Load DB on m7g.16xlarge $ pgbench -i --fillfactor=90 --scale=5644 --host=172.31.32.85 --username=postgres pgtest # Without patch $ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared -P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest ... "transaction type: <builtin: select only>", "scaling factor: 5644", "query mode: prepared", "number of clients: 256", "number of threads: 96", "duration: 600 s", "number of transactions actually processed: 359864937", "latency average = 0.420 ms", "latency stddev = 1.755 ms", "tps = 599770.727316 (including connections establishing)", "tps = 599826.788919 (excluding connections establishing)" # With patch $ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared -P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest ... "transaction type: <builtin: select only>", "scaling factor: 5644", "query mode: prepared", "number of clients: 256", "number of threads: 96", "duration: 600 s", "number of transactions actually processed: 405891881", "latency average = 0.371 ms", "latency stddev = 0.569 ms", "tps = 676480.900049 (including connections establishing)", "tps = 676523.557293 (excluding connections establishing)" ``` [0] https://www.postgresql.org/message-id/ZxgDEb_VpWyNZKB_%40nathan
Вложения
On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro <dipietro.salvatore@gmail.com> wrote: > we would like to propose the removal of the Instruction > Synchronization Barrier (isb) for aarch64 architectures. Based on our > testing on Graviton instances (m7g.16xlarge), we can see on average > over multiple iterations up to 12% better performance using PGBench > select-only and up to 9% with Sysbench oltp_read_only workloads. On > Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench > select-only and up to 6% with Sysbench oltp_read_only workloads. > We have also tested it putting more pressure on the spin_delay > function, enabling pg_stat_statements.track_planning with PGBench > read-only [0] and, on average, the patch shows up to 27% better > performance on m6g.16xlarge and up to 37% on m7g.16xlarge. Hmm. This was added only 3 years ago, supposedly because it made performance better: commit a82a5eee314df52f3183cedc0ecbcac7369243b1 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Apr 6 18:57:57 2022 -0400 Use ISB as a spin-delay instruction on ARM64. This seems beneficial on high-core-count machines, and not harmful on lesser hardware. However, older ARM32 gear doesn't have this instruction, so restrict the patch to ARM64. Geoffrey Blake Discussion: https://postgr.es/m/78338F29-9D7F-4DC8-BD71-E9674CE71425@amazon.com I think you should make some kind of argument about why the previous conclusion was wrong, or why something's changed between then and now. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro > <dipietro.salvatore@gmail.com> wrote: >> we would like to propose the removal of the Instruction >> Synchronization Barrier (isb) for aarch64 architectures. Based on our >> testing on Graviton instances (m7g.16xlarge), we can see on average >> over multiple iterations up to 12% better performance using PGBench >> select-only and up to 9% with Sysbench oltp_read_only workloads. On >> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench >> select-only and up to 6% with Sysbench oltp_read_only workloads. >> We have also tested it putting more pressure on the spin_delay >> function, enabling pg_stat_statements.track_planning with PGBench >> read-only [0] and, on average, the patch shows up to 27% better >> performance on m6g.16xlarge and up to 37% on m7g.16xlarge. > I think you should make some kind of argument about why the previous > conclusion was wrong, or why something's changed between then and now. TBH, those numbers are large enough that I flat out don't believe them. As noted in the previous thread, we've managed to squeeze out a lot of our dependencies on spinlock performance, via algorithmic changes, migration to atomic ops, etc. So I think ten-percent-ish improvement on a plain pgbench test just isn't very plausible. We certainly didn't see that kind of effect when we were doing that earlier round of testing --- we had to use a custom testing lashup to get numbers that were outside the noise at all. Of course, microbenchmarking is a tricky business, so it's possible that a different custom testing lashup would show the opposite results. But what's quoted above is sufficiently unlike our prior results that I can't help thinking something is wrong. One other thing that comes to mind is that pg_stat_statements has stretched the intention of "short straight-line code segment" to the point of unrecognizability. Maybe it needs to stop using spinlocks to protect pgssEntry updates. But I'm not sure if that would move the needle on whether ISB is a good idea or not. regards, tom lane
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Thu, May 01, 2025 at 02:48:59PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro >> <dipietro.salvatore@gmail.com> wrote: >>> we would like to propose the removal of the Instruction >>> Synchronization Barrier (isb) for aarch64 architectures. Based on our >>> testing on Graviton instances (m7g.16xlarge), we can see on average >>> over multiple iterations up to 12% better performance using PGBench >>> select-only and up to 9% with Sysbench oltp_read_only workloads. On >>> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench >>> select-only and up to 6% with Sysbench oltp_read_only workloads. >>> We have also tested it putting more pressure on the spin_delay >>> function, enabling pg_stat_statements.track_planning with PGBench >>> read-only [0] and, on average, the patch shows up to 27% better >>> performance on m6g.16xlarge and up to 37% on m7g.16xlarge. > >> I think you should make some kind of argument about why the previous >> conclusion was wrong, or why something's changed between then and now. > > TBH, those numbers are large enough that I flat out don't believe > them. As noted in the previous thread, we've managed to squeeze out > a lot of our dependencies on spinlock performance, via algorithmic > changes, migration to atomic ops, etc. So I think ten-percent-ish > improvement on a plain pgbench test just isn't very plausible. > We certainly didn't see that kind of effect when we were doing that > earlier round of testing --- we had to use a custom testing lashup > to get numbers that were outside the noise at all. I'm not sure there's actually any hot spinlock involved in a regular select-only pgbench (unless perhaps pg_stat_statements was enabled or something). But I do know pg_stat_statements.track_planning stresses the spinlock code quite a bit, and commit 3d0b4b1 recently added a non-locking initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still appropriate. It'd be interesting to see the performance difference of removing the ISB with and without commit 3d0b4b1 applied. -- nathan
Nathan Bossart <nathandbossart@gmail.com> writes: > ... commit 3d0b4b1 recently added a non-locking > initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still > appropriate. It'd be interesting to see the performance difference of > removing the ISB with and without commit 3d0b4b1 applied. Oh! That's an excellent point. The OP didn't mention if their tests were done before or after 3d0b4b1, but that might well matter. I still think pgbench is a very blunt tool for this type of testing, though. I recommend resurrecting the test_shm_mq-based hack discussed in the prior thread and seeing what that shows. regards, tom lane
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Thu, May 01, 2025 at 04:08:06PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> ... commit 3d0b4b1 recently added a non-locking >> initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still >> appropriate. It'd be interesting to see the performance difference of >> removing the ISB with and without commit 3d0b4b1 applied. > > Oh! That's an excellent point. The OP didn't mention if their tests > were done before or after 3d0b4b1, but that might well matter. > > I still think pgbench is a very blunt tool for this type of testing, > though. I recommend resurrecting the test_shm_mq-based hack discussed > in the prior thread and seeing what that shows. Well, I have interesting results. This is all on a c8g.24xlarge (96 cores, Neoverse-V2, Armv9.0-a). For the first test_shm_mq test, I ran the following: SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1); SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 2); SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4); ... This gave me the following results (values are in seconds): w/o 3d0b4b1 w/ 3d0b4b1 ISB no ISB ISB no ISB 1 1.4 1.6 1.5 1.6 2 2.1 2.0 2.1 2.1 4 3.2 3.5 3.3 3.5 8 7.4 8.1 7.2 8.4 16 18.0 35.9 22.7 23.4 32 35.7 85.6 53.7 49.5 64 85.1 ? 147.6 100.1 For the second test_shm_mq test, I ran at higher concurrency, so I had to reduce the loop counts: SELECT test_shm_mq_pipelined(16384, 'xyzzy', 100000, 32); ... That gave me the following: w/o 3d0b4b1 w/ 3d0b4b1 ISB no ISB ISB no ISB 32 0.4 0.8 0.5 0.6 64 2.0 4.8 1.3 1.1 128 6.1 29.3 7.5 2.1 256 43.0 66.4 24.4 4.5 Finally, I ran the pgbench select-only test with pg_stat_statements.track_planning enabled (values are in thousands of transactions per second): w/o 3d0b4b1 w/ 3d0b4b1 ISB no ISB ISB no ISB 71.4 67.4 538.2 891.2 So... * The ISB does seem to have a positive effect without commit 3d0b4b1 applied. * With commit 3d0b4b1 applied, removing the ISB seems to have a positive effect at high concurrencies. This is especially pronounced in the pgbench test. * With commit 3d0b4b1 applied, removing the ISB doesn't change much at lower concurrencies, and there might even be a small regression. * At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress some test_shm_mq tests. Removing the ISB instruction appears to help in some cases, but not all. -- nathan
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Salvatore Dipietro
Дата:
On Thu, 1 May 2025 at 13:08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Oh! That's an excellent point. The OP didn't mention if their tests > were done before or after 3d0b4b1, but that might well matter. The benchmarks we conducted are based on REL_17_2 branch which do not include the TAS_SPIN(lock) change for ARM yet.
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Salvatore Dipietro
Дата:
On Thu, 1 May 2025 at 14:50, Nathan Bossart <nathandbossart@gmail.com> wrote: > So... > > * The ISB does seem to have a positive effect without commit 3d0b4b1 > applied. > > * With commit 3d0b4b1 applied, removing the ISB seems to have a positive > effect at high concurrencies. This is especially pronounced in the > pgbench test. > > * With commit 3d0b4b1 applied, removing the ISB doesn't change much at > lower concurrencies, and there might even be a small regression. > > * At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress > some test_shm_mq tests. Removing the ISB instruction appears to help in > some cases, but not all. Based on your findings Nathan, what is the best way to proceed for this change? Do we need more validation for it? If yes, which kind? Thanks, Salvatore
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Tue, May 13, 2025 at 11:37:45AM -0700, Salvatore Dipietro wrote: > On Thu, 1 May 2025 at 14:50, Nathan Bossart <nathandbossart@gmail.com> wrote: >> So... >> >> * The ISB does seem to have a positive effect without commit 3d0b4b1 >> applied. >> >> * With commit 3d0b4b1 applied, removing the ISB seems to have a positive >> effect at high concurrencies. This is especially pronounced in the >> pgbench test. >> >> * With commit 3d0b4b1 applied, removing the ISB doesn't change much at >> lower concurrencies, and there might even be a small regression. >> >> * At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress >> some test_shm_mq tests. Removing the ISB instruction appears to help in >> some cases, but not all. > > Based on your findings Nathan, what is the best way to proceed for this change? > Do we need more validation for it? If yes, which kind? Well, I am confused because your recent message [0] indicated that you saw improvement without the non-locking initial test in TAS_SPIN, which seems to contradict my findings [1]. Could you retry your tests on v18devel? It might also be useful to repeat the tests on a variety of hardware to ensure it's a win across the board. [0] https://postgr.es/m/CAGnuAhXNQXCcS1nCeD6E0Dyfi4Ms-b0sjcm79Y9iMi5WxUqM4g%40mail.gmail.com [1] https://postgr.es/m/aBPsrFbjnrqp3_8S%40nathan -- nathan
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Mon, May 19, 2025 at 11:38:39AM -0500, Nathan Bossart wrote: > On Tue, May 13, 2025 at 11:37:45AM -0700, Salvatore Dipietro wrote: >> Based on your findings Nathan, what is the best way to proceed for this change? >> Do we need more validation for it? If yes, which kind? > > Well, I am confused because your recent message [0] indicated that you saw > improvement without the non-locking initial test in TAS_SPIN, which seems > to contradict my findings [1]. Could you retry your tests on v18devel? It > might also be useful to repeat the tests on a variety of hardware to ensure > it's a win across the board. I should probably also mention that we are in feature-freeze until v19 development opens in July. -- nathan
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Salvatore Dipietro
Дата:
On Mon, 19 May 2025 at 09:38, Nathan Bossart <nathandbossart@gmail.com> wrote: > Could you retry your tests on v18devel? It might also be useful to repeat the tests on a variety of hardware to ensure > it's a win across the board. Hi Nathan, Thanks for your clarification. As you requested, I have performed more tests on different instance types and sizes. In particular, I have run the `test_shm_mq_pipelined` benchmark using Ubuntu 22.04 on m7g.[2,8,16]xlarge and c8g.[2,8.24]xlarge instances with PG master branch (commit: 84914e964b4). Each test has been repeated 30 times and here is the average (in seconds) and the difference from baseline (Master). Graviton3 instances (m7g) results: | Concurrency | Loops | 2xl Master | 2xl No-ISB | 8xl Master | 8xl No-ISB | 16xl Master | 16xl No-ISB | |-------------|----------|------------|--------------|------------|--------------|-------------|----------------| | 1 | 10000000 | 1.9s | 1.9s (1.00x) | 1.9s | 1.8s (1.06x) | 1.7s | 1.8s (0.94x) | | 2 | 10000000 | 2.4s | 2.4s (1.00x) | 2.5s | 2.3s (1.09x) | 2.3s | 2.3s (1.00x) | | 4 | 10000000 | 3.8s | 3.8s (1.00x) | 3.8s | 3.6s (1.06x) | 3.5s | 3.8s (0.92x) | | 8 | 10000000 | 8.9s | 10.3s (0.86x)| 7.5s | 8.6s (0.87x) | 7.8s | 8.9s (0.88x) | | 16 | 10000000 | 21.6s | 22.5s (0.96x)| 22.5s | 23.6s (0.95x)| 21.4s | 24.9s (0.86x) | | 32 | 10000000 | 42.8s | 41.3s (1.04x)| 114.7s | 52.0s (2.21x)| 88.6s | 49.9s (1.78x) | | 64 | 10000000 | 81.8s | 73.3s (1.12x)| 395.9s | 85.2s (4.65x)| 381.3s | 97.0s (3.93x) | | 32 | 100000 | 0.4s | 0.4s (1.00x) | 1.1s | 0.5s (2.20x) | 1.1s | 0.6s (1.83x) | | 64 | 100000 | 0.8s | 0.8s (1.00x) | 3.9s | 0.9s (4.33x) | 3.9s | 1.1s (3.55x) | | 128 | 100000 | 1.6s | 1.5s (1.07x) | 8.5s | 1.9s (4.47x) | 13.3s | 2.0s (6.65x) | | 256 | 100000 | 3.2s | 3.1s (1.03x) | 19.8s | 4.0s (4.95x) | 35.9s | 4.1s (8.76x) | Graviton4 instances (c8g) results: | Concurrency | Loops | 2xl Master | 2xl No-ISB | 8xl Master | 8xl No-ISB | 24xl Master | 24xl No-ISB | |-------------|----------|------------|---------------|------------|---------------|-------------|----------------| | 1 | 10000000 | 1.7s | 1.6s (1.06x) | 1.6s | 1.6s (1.00x) | 1.6s | 1.5s (1.07x) | | 2 | 10000000 | 2.2s | 2.2s (1.00x) | 2.2s | 2.2s (1.00x) | 2.2s | 2.1s (1.05x) | | 4 | 10000000 | 3.4s | 3.5s (0.97x) | 3.5s | 3.4s (1.03x) | 3.5s | 3.4s (1.03x) | | 8 | 10000000 | 10.9s | 13.9s (0.78x) | 8.2s | 9.4s (0.87x) | 7.8s | 8.2s (0.95x) | | 16 | 10000000 | 23.6s | 27.0s (0.87x) | 26.3s | 26.1s (1.01x) | 27.1s | 28.1s (0.96x) | | 32 | 10000000 | 44.6s | 46.9s (0.95x) | 60.6s | 47.7s (1.27x) | 62.1s | 50.4s (1.23x) | | 64 | 10000000 | 81.4s | 81.5s (1.00x) | 189.4s | 91.5s (2.07x) | 176.9s | 101.3s (1.75x) | | 32 | 100000 | 0.5s | 0.5s (1.00x) | 0.6s | 0.5s (1.20x) | 0.6s | 0.5s (1.20x) | | 64 | 100000 | 0.8s | 0.8s (1.00x) | 1.7s | 0.9s (1.89x) | 2.1s | 1.2s (1.75x) | | 128 | 100000 | 1.5s | 1.6s (0.94x) | 4.5s | 1.9s (2.37x) | 7.8s | 2.1s (3.71x) | | 256 | 100000 | 3.3s | 3.1s (1.06x) | 9.7s | 4.1s (2.37x) | 22.0s | 4.5s (4.89x) | We can notice that with low concurrency (1,2,4) results are similar while with medium concurrency (8,16) the No-ISB approach can introduce some regression especially on smaller instances. However, we can see some significant positive performance impact with high concurrency (>=32) settings on large instances (up to 8.76x on m7g.16xl with 256 concurrency).
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Thu, Jun 19, 2025 at 12:10:52PM -0700, Salvatore Dipietro wrote: > We can notice that with low concurrency (1,2,4) results are similar > while with medium concurrency (8,16) > the No-ISB approach can introduce some regression especially on > smaller instances. However, we can see some significant > positive performance impact with high concurrency (>=32) settings on > large instances (up to 8.76x on m7g.16xl with 256 concurrency). Given these mixed results, it's unclear to me how exactly we should proceed. Perhaps there is another approach that reduces the regressions to a negligible level while still producing gains at higher levels of concurrency. Or maybe we can convince ourselves that these regressions aren't worth worrying about, but that seems like a bit of a stretch to me. -- nathan
Nathan Bossart <nathandbossart@gmail.com> writes: > On Thu, Jun 19, 2025 at 12:10:52PM -0700, Salvatore Dipietro wrote: >> We can notice that with low concurrency (1,2,4) results are similar >> while with medium concurrency (8,16) >> the No-ISB approach can introduce some regression especially on >> smaller instances. However, we can see some significant >> positive performance impact with high concurrency (>=32) settings on >> large instances (up to 8.76x on m7g.16xl with 256 concurrency). > Given these mixed results, it's unclear to me how exactly we should > proceed. Perhaps there is another approach that reduces the regressions to > a negligible level while still producing gains at higher levels of > concurrency. Or maybe we can convince ourselves that these regressions > aren't worth worrying about, but that seems like a bit of a stretch to me. I agree; I'm also worried that this may be optimizing for one particular ARM implementation and have negative effects on others. Perhaps a better way forward is to identify which spinlock is suffering such huge contention and try to alleviate that at a higher design level. (That could benefit more than just ARM, too.) regards, tom lane
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Álvaro Herrera
Дата:
On 2025-May-01, Tom Lane wrote: > One other thing that comes to mind is that pg_stat_statements > has stretched the intention of "short straight-line code segment" > to the point of unrecognizability. Maybe it needs to stop using > spinlocks to protect pgssEntry updates. But I'm not sure if that > would move the needle on whether ISB is a good idea or not. Yeah, it looks like pgss_store() is being too generous on the amount of code run with that spinlock held. However, changing that spinlock to an lwlock doesn't look easy, because of the way each pgss entry is created as a dynahash entry, and then deallocated from there. With spinlocks we can just reinit the spinlock each time, but that doesn't work with lwlocks. We have no easy way to associate then disassociate each entry from a specific lwlock. Maybe we'd need to do RequestNamedLWLockTranche("pg_stat_statements", 1 + pgss_max) at startup, then put them all in an lwlock-freelist: a list of unused lwlocks in this tranche, to take one on entry_alloc and put it back on entry_dealloc. I think Salvatore's comment that enabling track_planning makes the performance gain more pronounced is evidence that changing this spinlock to lwlock would be beneficial. Another benefit of using an lwlock: pg_stat_statements_internal() could use shared mode to read each entry. Probably not as critical (because it requires the view to be read), but it'd reduce the performance hit when the user is reading the view. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "I can't go to a restaurant and order food because I keep looking at the fonts on the menu. Five minutes later I realize that it's also talking about food" (Donald Knuth)
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Thu, Aug 14, 2025 at 11:29:08AM +0200, Álvaro Herrera wrote: > On 2025-May-01, Tom Lane wrote: >> One other thing that comes to mind is that pg_stat_statements >> has stretched the intention of "short straight-line code segment" >> to the point of unrecognizability. Maybe it needs to stop using >> spinlocks to protect pgssEntry updates. But I'm not sure if that >> would move the needle on whether ISB is a good idea or not. > > Yeah, it looks like pgss_store() is being too generous on the amount of > code run with that spinlock held. However, changing that spinlock to an > lwlock doesn't look easy, because of the way each pgss entry is created > as a dynahash entry, and then deallocated from there. With spinlocks we > can just reinit the spinlock each time, but that doesn't work with > lwlocks. We have no easy way to associate then disassociate each entry > from a specific lwlock. I've attempted multiple times now to replace this spinlock with LWLocks or atomics, but thus far haven't produced anything that improved matters [0]. IIRC I just dynamically created LWLocks with LWLockInitialize() as needed, which is tedious but easy enough. My recollection is that the use of floats for many of the values greatly complicated the atomics attempts. So, fixing it might require some larger changes. [0] https://postgr.es/m/Zs4hJ6-Fg8DMgU_P%40nathan -- nathan
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Andres Freund
Дата:
Hi, On 2025-08-14 11:29:08 +0200, Álvaro Herrera wrote: > On 2025-May-01, Tom Lane wrote: > > > One other thing that comes to mind is that pg_stat_statements > > has stretched the intention of "short straight-line code segment" > > to the point of unrecognizability. Maybe it needs to stop using > > spinlocks to protect pgssEntry updates. But I'm not sure if that > > would move the needle on whether ISB is a good idea or not. > > Yeah, it looks like pgss_store() is being too generous on the amount of > code run with that spinlock held. Indeed. To the point that these days pgss is basically unusable for workloads with many short queries. > However, changing that spinlock to an lwlock doesn't look easy, because of > the way each pgss entry is created as a dynahash entry, and then deallocated > from there. With spinlocks we can just reinit the spinlock each time, but > that doesn't work with lwlocks. We have no easy way to associate then > disassociate each entry from a specific lwlock. I'm not following? The lwlock can just be inside the struct, just like the spinlock is? "Association" is just LWLockInitialize() and deassociation is not needed. Greetings, Andres Freund
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Fri, Aug 15, 2025 at 01:39:52PM -0400, Andres Freund wrote: > On 2025-08-14 11:29:08 +0200, Álvaro Herrera wrote: >> However, changing that spinlock to an lwlock doesn't look easy, because of >> the way each pgss entry is created as a dynahash entry, and then deallocated >> from there. With spinlocks we can just reinit the spinlock each time, but >> that doesn't work with lwlocks. We have no easy way to associate then >> disassociate each entry from a specific lwlock. > > I'm not following? The lwlock can just be inside the struct, just like the > spinlock is? "Association" is just LWLockInitialize() and deassociation is not > needed. Indeed. I rebased an old patch that I had lying around to demonstrate. If my past testing [0] is to be trusted, this actually hurts performance, unfortunately. [0] https://postgr.es/m/Zs4hJ6-Fg8DMgU_P@nathan -- nathan
Вложения
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Andres Freund
Дата:
Hi, On 2025-08-15 12:57:52 -0500, Nathan Bossart wrote: > On Fri, Aug 15, 2025 at 01:39:52PM -0400, Andres Freund wrote: > > On 2025-08-14 11:29:08 +0200, Álvaro Herrera wrote: > >> However, changing that spinlock to an lwlock doesn't look easy, because of > >> the way each pgss entry is created as a dynahash entry, and then deallocated > >> from there. With spinlocks we can just reinit the spinlock each time, but > >> that doesn't work with lwlocks. We have no easy way to associate then > >> disassociate each entry from a specific lwlock. > > > > I'm not following? The lwlock can just be inside the struct, just like the > > spinlock is? "Association" is just LWLockInitialize() and deassociation is not > > needed. > > Indeed. I rebased an old patch that I had lying around to demonstrate. If > my past testing [0] is to be trusted, this actually hurts performance, > unfortunately. FWIW, rather interesting result of testing the patch briefly: On my older workstation, the patch is a substantial *gain* when there's a lot of contention. But on my newer workstation it's a *loss*. The penalty from enabling pg_stat_statements for readonly pgbench on the newer workstation is rather bad - about 1/3 the throughput. I think the main reason that lwlocks loose on the newer machine is that we loose spinning. The newer machine has more cores and more numa domains and the fairer locks lead to more cacheline pingpong... IMO, the only way to actually make pg_stat_statements scale is to move to a model much more like our regular stats. I.e. accumulate counters in backend local memory and only occasionally update the shared stats. Even if you were to move pgss successfully to atomics, the cacheline contention still would be terrible for performance. FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns* compared to using the spinlocks. You'd replace one atomic operation with dozens, to update all those fields individually. With loads of cacheline pingpong inbetween. Greetings, Andres Freund
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Fri, Aug 15, 2025 at 04:13:30PM -0400, Andres Freund wrote: > IMO, the only way to actually make pg_stat_statements scale is to move to a > model much more like our regular stats. I.e. accumulate counters in backend > local memory and only occasionally update the shared stats. Agreed. I remember discussing something similar at pgconf.dev this year. > FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns* > compared to using the spinlocks. You'd replace one atomic operation with > dozens, to update all those fields individually. With loads of cacheline > pingpong inbetween. Right. At some point I tried moving most things to atomics and leaving the rest behind the spinlock, and IIRC there wasn't really any improvement. I didn't dig into whether that was because of atomic-related cache line ping-pong or the existing spinlock, but regardless, I quickly discarded that idea. -- nathan
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Fri, Aug 15, 2025 at 03:25:20PM -0500, Nathan Bossart wrote: > On Fri, Aug 15, 2025 at 04:13:30PM -0400, Andres Freund wrote: >> IMO, the only way to actually make pg_stat_statements scale is to move to a >> model much more like our regular stats. I.e. accumulate counters in backend >> local memory and only occasionally update the shared stats. > > Agreed. I remember discussing something similar at pgconf.dev this year. BTW I'm planning to give this a try... -- nathan
Andres Freund <andres@anarazel.de> writes: > FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns* > compared to using the spinlocks. You'd replace one atomic operation with > dozens, to update all those fields individually. With loads of cacheline > pingpong inbetween. Not only that, but you'd no longer have any semblance of read-consistency between the fields of a stats entry. regards, tom lane
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Michael Paquier
Дата:
On Fri, Aug 15, 2025 at 03:54:29PM -0500, Nathan Bossart wrote: > On Fri, Aug 15, 2025 at 03:25:20PM -0500, Nathan Bossart wrote: >> On Fri, Aug 15, 2025 at 04:13:30PM -0400, Andres Freund wrote: >>> IMO, the only way to actually make pg_stat_statements scale is to move to a >>> model much more like our regular stats. I.e. accumulate counters in backend >>> local memory and only occasionally update the shared stats. >> >> Agreed. I remember discussing something similar at pgconf.dev this year. > > BTW I'm planning to give this a try... This means to use the pluggable pgstats API to achieve that. FWIW, one difficulty I am foreseeing is how we want to cap and control the number of PGSS entries that are saved in the pgstats hash table, based on the current max PGSS can be set to. Just saying that we'll most likely rely on a separate zone of shared memory to track this information. Most of that should happen in the flush callback, based on the current infra in place, I guess. More stats are good to have, still capping them is key to keep the backend under control, and it does not make sense to me to use a fixed chunk of shared memory for this side (meaning a fixed-sized pgstats kind for PGSS). Another benefit is to make the max reloadable. -- Michael
Вложения
Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
От
Nathan Bossart
Дата:
On Mon, Aug 11, 2025 at 04:12:48PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> On Thu, Jun 19, 2025 at 12:10:52PM -0700, Salvatore Dipietro wrote: >>> We can notice that with low concurrency (1,2,4) results are similar >>> while with medium concurrency (8,16) >>> the No-ISB approach can introduce some regression especially on >>> smaller instances. However, we can see some significant >>> positive performance impact with high concurrency (>=32) settings on >>> large instances (up to 8.76x on m7g.16xl with 256 concurrency). > >> Given these mixed results, it's unclear to me how exactly we should >> proceed. Perhaps there is another approach that reduces the regressions to >> a negligible level while still producing gains at higher levels of >> concurrency. Or maybe we can convince ourselves that these regressions >> aren't worth worrying about, but that seems like a bit of a stretch to me. > > I agree; I'm also worried that this may be optimizing for one > particular ARM implementation and have negative effects on others. Based on this discussion, I am marking the patch as Returned with Feedback. -- nathan