Обсуждение: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Поиск
Список
Период
Сортировка

Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Salvatore Dipietro
Дата:
Hi,
we would like to propose the removal of the Instruction
Synchronization Barrier (isb) for aarch64 architectures. Based on our
testing on Graviton instances (m7g.16xlarge), we can see on average
over multiple iterations up to 12% better performance using PGBench
select-only and up to 9% with Sysbench oltp_read_only workloads. On
Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
select-only and up to 6% with Sysbench oltp_read_only workloads.
We have also tested it putting more pressure on the spin_delay
function, enabling pg_stat_statements.track_planning with PGBench
read-only [0] and, on average, the patch shows up to 27% better
performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

Testing environment:
- PostgreSQL version: 17.2
- Operating System: Ubuntu 22.04
- Test Platform: AWS Graviton instances (m6g.16xlarge, m7g.16xlarge
and m8g.24xlarge)

Our benchmark results on PGBench select-only without
pg_stat_statements.track_planning:
```
# Load DB on m7g.16xlarge
$ pgbench -i --fillfactor=90 --scale=5644 --host=172.31.32.85
--username=postgres pgtest

# Without patch
$ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared
-P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest
...
    "transaction type: <builtin: select only>",
    "scaling factor: 5644",
    "query mode: prepared",
    "number of clients: 256",
    "number of threads: 96",
    "duration: 600 s",
    "number of transactions actually processed: 359864937",
    "latency average = 0.420 ms",
    "latency stddev = 1.755 ms",
    "tps = 599770.727316 (including connections establishing)",
    "tps = 599826.788919 (excluding connections establishing)"


# With patch
$ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared
-P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest
...
    "transaction type: <builtin: select only>",
    "scaling factor: 5644",
    "query mode: prepared",
    "number of clients: 256",
    "number of threads: 96",
    "duration: 600 s",
    "number of transactions actually processed: 405891881",
    "latency average = 0.371 ms",
    "latency stddev = 0.569 ms",
    "tps = 676480.900049 (including connections establishing)",
    "tps = 676523.557293 (excluding connections establishing)"
```

[0] https://www.postgresql.org/message-id/ZxgDEb_VpWyNZKB_%40nathan

Вложения
On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro
<dipietro.salvatore@gmail.com> wrote:
> we would like to propose the removal of the Instruction
> Synchronization Barrier (isb) for aarch64 architectures. Based on our
> testing on Graviton instances (m7g.16xlarge), we can see on average
> over multiple iterations up to 12% better performance using PGBench
> select-only and up to 9% with Sysbench oltp_read_only workloads. On
> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
> select-only and up to 6% with Sysbench oltp_read_only workloads.
> We have also tested it putting more pressure on the spin_delay
> function, enabling pg_stat_statements.track_planning with PGBench
> read-only [0] and, on average, the patch shows up to 27% better
> performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

Hmm. This was added only 3 years ago, supposedly because it made
performance better:

commit a82a5eee314df52f3183cedc0ecbcac7369243b1
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Wed Apr 6 18:57:57 2022 -0400

    Use ISB as a spin-delay instruction on ARM64.

    This seems beneficial on high-core-count machines, and not harmful
    on lesser hardware.  However, older ARM32 gear doesn't have this
    instruction, so restrict the patch to ARM64.

    Geoffrey Blake

    Discussion:
https://postgr.es/m/78338F29-9D7F-4DC8-BD71-E9674CE71425@amazon.com

I think you should make some kind of argument about why the previous
conclusion was wrong, or why something's changed between then and now.

--
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro
> <dipietro.salvatore@gmail.com> wrote:
>> we would like to propose the removal of the Instruction
>> Synchronization Barrier (isb) for aarch64 architectures. Based on our
>> testing on Graviton instances (m7g.16xlarge), we can see on average
>> over multiple iterations up to 12% better performance using PGBench
>> select-only and up to 9% with Sysbench oltp_read_only workloads. On
>> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
>> select-only and up to 6% with Sysbench oltp_read_only workloads.
>> We have also tested it putting more pressure on the spin_delay
>> function, enabling pg_stat_statements.track_planning with PGBench
>> read-only [0] and, on average, the patch shows up to 27% better
>> performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

> I think you should make some kind of argument about why the previous
> conclusion was wrong, or why something's changed between then and now.

TBH, those numbers are large enough that I flat out don't believe
them.  As noted in the previous thread, we've managed to squeeze out
a lot of our dependencies on spinlock performance, via algorithmic
changes, migration to atomic ops, etc.  So I think ten-percent-ish
improvement on a plain pgbench test just isn't very plausible.
We certainly didn't see that kind of effect when we were doing that
earlier round of testing --- we had to use a custom testing lashup
to get numbers that were outside the noise at all.

Of course, microbenchmarking is a tricky business, so it's possible
that a different custom testing lashup would show the opposite
results.  But what's quoted above is sufficiently unlike our prior
results that I can't help thinking something is wrong.

One other thing that comes to mind is that pg_stat_statements
has stretched the intention of "short straight-line code segment"
to the point of unrecognizability.  Maybe it needs to stop using
spinlocks to protect pgssEntry updates.  But I'm not sure if that
would move the needle on whether ISB is a good idea or not.

            regards, tom lane



On Thu, May 01, 2025 at 02:48:59PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro
>> <dipietro.salvatore@gmail.com> wrote:
>>> we would like to propose the removal of the Instruction
>>> Synchronization Barrier (isb) for aarch64 architectures. Based on our
>>> testing on Graviton instances (m7g.16xlarge), we can see on average
>>> over multiple iterations up to 12% better performance using PGBench
>>> select-only and up to 9% with Sysbench oltp_read_only workloads. On
>>> Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
>>> select-only and up to 6% with Sysbench oltp_read_only workloads.
>>> We have also tested it putting more pressure on the spin_delay
>>> function, enabling pg_stat_statements.track_planning with PGBench
>>> read-only [0] and, on average, the patch shows up to 27% better
>>> performance on m6g.16xlarge and up to 37% on m7g.16xlarge.
> 
>> I think you should make some kind of argument about why the previous
>> conclusion was wrong, or why something's changed between then and now.
> 
> TBH, those numbers are large enough that I flat out don't believe
> them.  As noted in the previous thread, we've managed to squeeze out
> a lot of our dependencies on spinlock performance, via algorithmic
> changes, migration to atomic ops, etc.  So I think ten-percent-ish
> improvement on a plain pgbench test just isn't very plausible.
> We certainly didn't see that kind of effect when we were doing that
> earlier round of testing --- we had to use a custom testing lashup
> to get numbers that were outside the noise at all.

I'm not sure there's actually any hot spinlock involved in a regular
select-only pgbench (unless perhaps pg_stat_statements was enabled or
something).  But I do know pg_stat_statements.track_planning stresses the
spinlock code quite a bit, and commit 3d0b4b1 recently added a non-locking
initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still
appropriate.  It'd be interesting to see the performance difference of
removing the ISB with and without commit 3d0b4b1 applied.

-- 
nathan



Nathan Bossart <nathandbossart@gmail.com> writes:
> ... commit 3d0b4b1 recently added a non-locking
> initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still
> appropriate.  It'd be interesting to see the performance difference of
> removing the ISB with and without commit 3d0b4b1 applied.

Oh!  That's an excellent point.  The OP didn't mention if their tests
were done before or after 3d0b4b1, but that might well matter.

I still think pgbench is a very blunt tool for this type of testing,
though.  I recommend resurrecting the test_shm_mq-based hack discussed
in the prior thread and seeing what that shows.

            regards, tom lane



On Thu, May 01, 2025 at 04:08:06PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> ... commit 3d0b4b1 recently added a non-locking
>> initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still
>> appropriate.  It'd be interesting to see the performance difference of
>> removing the ISB with and without commit 3d0b4b1 applied.
> 
> Oh!  That's an excellent point.  The OP didn't mention if their tests
> were done before or after 3d0b4b1, but that might well matter.
> 
> I still think pgbench is a very blunt tool for this type of testing,
> though.  I recommend resurrecting the test_shm_mq-based hack discussed
> in the prior thread and seeing what that shows.

Well, I have interesting results.  This is all on a c8g.24xlarge (96 cores,
Neoverse-V2, Armv9.0-a).

For the first test_shm_mq test, I ran the following:

    SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
    SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 2);
    SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
    ...

This gave me the following results (values are in seconds):

            w/o 3d0b4b1     w/ 3d0b4b1
            ISB     no ISB  ISB     no ISB
    1       1.4     1.6     1.5     1.6
    2       2.1     2.0     2.1     2.1
    4       3.2     3.5     3.3     3.5
    8       7.4     8.1     7.2     8.4
    16      18.0    35.9    22.7    23.4
    32      35.7    85.6    53.7    49.5
    64      85.1    ?       147.6   100.1

For the second test_shm_mq test, I ran at higher concurrency, so I had to
reduce the loop counts:

    SELECT test_shm_mq_pipelined(16384, 'xyzzy', 100000, 32);
    ...

That gave me the following:

            w/o 3d0b4b1     w/ 3d0b4b1
            ISB     no ISB  ISB     no ISB
    32      0.4     0.8     0.5     0.6
    64      2.0     4.8     1.3     1.1
    128     6.1     29.3    7.5     2.1
    256     43.0    66.4    24.4    4.5

Finally, I ran the pgbench select-only test with
pg_stat_statements.track_planning enabled (values are in thousands of
transactions per second):

    w/o 3d0b4b1     w/ 3d0b4b1
    ISB     no ISB  ISB     no ISB
    71.4    67.4    538.2   891.2

So...

* The ISB does seem to have a positive effect without commit 3d0b4b1
  applied.

* With commit 3d0b4b1 applied, removing the ISB seems to have a positive
  effect at high concurrencies.  This is especially pronounced in the
  pgbench test.

* With commit 3d0b4b1 applied, removing the ISB doesn't change much at
  lower concurrencies, and there might even be a small regression.

* At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress
  some test_shm_mq tests.  Removing the ISB instruction appears to help in
  some cases, but not all.

-- 
nathan



On Thu, 1 May 2025 at 13:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Oh!  That's an excellent point.  The OP didn't mention if their tests
> were done before or after 3d0b4b1, but that might well matter.

The benchmarks we conducted are based on REL_17_2 branch which do not
include the TAS_SPIN(lock) change for ARM yet.



On Thu, 1 May 2025 at 14:50, Nathan Bossart <nathandbossart@gmail.com> wrote:
> So...
>
> * The ISB does seem to have a positive effect without commit 3d0b4b1
>   applied.
>
> * With commit 3d0b4b1 applied, removing the ISB seems to have a positive
>   effect at high concurrencies.  This is especially pronounced in the
>   pgbench test.
>
> * With commit 3d0b4b1 applied, removing the ISB doesn't change much at
>   lower concurrencies, and there might even be a small regression.
>
> * At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress
>   some test_shm_mq tests.  Removing the ISB instruction appears to help in
>   some cases, but not all.

Based on your findings Nathan, what is the best way to proceed for this change?
Do we need more validation for it? If yes, which kind?

Thanks,
Salvatore



On Tue, May 13, 2025 at 11:37:45AM -0700, Salvatore Dipietro wrote:
> On Thu, 1 May 2025 at 14:50, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> So...
>>
>> * The ISB does seem to have a positive effect without commit 3d0b4b1
>>   applied.
>>
>> * With commit 3d0b4b1 applied, removing the ISB seems to have a positive
>>   effect at high concurrencies.  This is especially pronounced in the
>>   pgbench test.
>>
>> * With commit 3d0b4b1 applied, removing the ISB doesn't change much at
>>   lower concurrencies, and there might even be a small regression.
>>
>> * At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress
>>   some test_shm_mq tests.  Removing the ISB instruction appears to help in
>>   some cases, but not all.
> 
> Based on your findings Nathan, what is the best way to proceed for this change?
> Do we need more validation for it? If yes, which kind?

Well, I am confused because your recent message [0] indicated that you saw
improvement without the non-locking initial test in TAS_SPIN, which seems
to contradict my findings [1].  Could you retry your tests on v18devel?  It
might also be useful to repeat the tests on a variety of hardware to ensure
it's a win across the board.

[0] https://postgr.es/m/CAGnuAhXNQXCcS1nCeD6E0Dyfi4Ms-b0sjcm79Y9iMi5WxUqM4g%40mail.gmail.com
[1] https://postgr.es/m/aBPsrFbjnrqp3_8S%40nathan

-- 
nathan



On Mon, May 19, 2025 at 11:38:39AM -0500, Nathan Bossart wrote:
> On Tue, May 13, 2025 at 11:37:45AM -0700, Salvatore Dipietro wrote:
>> Based on your findings Nathan, what is the best way to proceed for this change?
>> Do we need more validation for it? If yes, which kind?
> 
> Well, I am confused because your recent message [0] indicated that you saw
> improvement without the non-locking initial test in TAS_SPIN, which seems
> to contradict my findings [1].  Could you retry your tests on v18devel?  It
> might also be useful to repeat the tests on a variety of hardware to ensure
> it's a win across the board.

I should probably also mention that we are in feature-freeze until v19
development opens in July.

-- 
nathan



Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Salvatore Dipietro
Дата:
On Mon, 19 May 2025 at 09:38, Nathan Bossart <nathandbossart@gmail.com> wrote:
 > Could you retry your tests on v18devel?  It might also be useful to
repeat the tests on a variety of hardware to ensure
> it's a win across the board.


Hi Nathan,
Thanks for your clarification. As you requested, I have performed more
tests on different instance types and sizes.
In particular, I have run the `test_shm_mq_pipelined` benchmark using
Ubuntu 22.04 on m7g.[2,8,16]xlarge and
c8g.[2,8.24]xlarge instances with PG master branch (commit:
84914e964b4). Each test has been repeated 30 times
and here is the average (in seconds) and the difference from baseline (Master).

Graviton3 instances (m7g) results:
| Concurrency | Loops    | 2xl Master | 2xl No-ISB   | 8xl Master |
8xl No-ISB   | 16xl Master | 16xl No-ISB    |
|-------------|----------|------------|--------------|------------|--------------|-------------|----------------|
| 1           | 10000000 | 1.9s       | 1.9s (1.00x) | 1.9s       |
1.8s (1.06x) | 1.7s        | 1.8s (0.94x)   |
| 2           | 10000000 | 2.4s       | 2.4s (1.00x) | 2.5s       |
2.3s (1.09x) | 2.3s        | 2.3s (1.00x)   |
| 4           | 10000000 | 3.8s       | 3.8s (1.00x) | 3.8s       |
3.6s (1.06x) | 3.5s        | 3.8s (0.92x)   |
| 8           | 10000000 | 8.9s       | 10.3s (0.86x)| 7.5s       |
8.6s (0.87x) | 7.8s        | 8.9s (0.88x)   |
| 16          | 10000000 | 21.6s      | 22.5s (0.96x)| 22.5s      |
23.6s (0.95x)| 21.4s       | 24.9s (0.86x)  |
| 32          | 10000000 | 42.8s      | 41.3s (1.04x)| 114.7s     |
52.0s (2.21x)| 88.6s       | 49.9s (1.78x)  |
| 64          | 10000000 | 81.8s      | 73.3s (1.12x)| 395.9s     |
85.2s (4.65x)| 381.3s      | 97.0s (3.93x)  |
| 32          | 100000   | 0.4s       | 0.4s (1.00x) | 1.1s       |
0.5s (2.20x) | 1.1s        | 0.6s (1.83x)   |
| 64          | 100000   | 0.8s       | 0.8s (1.00x) | 3.9s       |
0.9s (4.33x) | 3.9s        | 1.1s (3.55x)   |
| 128         | 100000   | 1.6s       | 1.5s (1.07x) | 8.5s       |
1.9s (4.47x) | 13.3s       | 2.0s (6.65x)   |
| 256         | 100000   | 3.2s       | 3.1s (1.03x) | 19.8s      |
4.0s (4.95x) | 35.9s       | 4.1s (8.76x)   |

Graviton4 instances (c8g) results:
| Concurrency | Loops    | 2xl Master | 2xl No-ISB    | 8xl Master |
8xl No-ISB    | 24xl Master | 24xl No-ISB    |
|-------------|----------|------------|---------------|------------|---------------|-------------|----------------|
| 1           | 10000000 | 1.7s       | 1.6s (1.06x)  | 1.6s       |
1.6s (1.00x)  | 1.6s        | 1.5s (1.07x)   |
| 2           | 10000000 | 2.2s       | 2.2s (1.00x)  | 2.2s       |
2.2s (1.00x)  | 2.2s        | 2.1s (1.05x)   |
| 4           | 10000000 | 3.4s       | 3.5s (0.97x)  | 3.5s       |
3.4s (1.03x)  | 3.5s        | 3.4s (1.03x)   |
| 8           | 10000000 | 10.9s      | 13.9s (0.78x) | 8.2s       |
9.4s (0.87x)  | 7.8s        | 8.2s (0.95x)   |
| 16          | 10000000 | 23.6s      | 27.0s (0.87x) | 26.3s      |
26.1s (1.01x) | 27.1s       | 28.1s (0.96x)  |
| 32          | 10000000 | 44.6s      | 46.9s (0.95x) | 60.6s      |
47.7s (1.27x) | 62.1s       | 50.4s (1.23x)  |
| 64          | 10000000 | 81.4s      | 81.5s (1.00x) | 189.4s     |
91.5s (2.07x) | 176.9s      | 101.3s (1.75x) |
| 32          | 100000   | 0.5s       | 0.5s (1.00x)  | 0.6s       |
0.5s (1.20x)  | 0.6s        | 0.5s (1.20x)   |
| 64          | 100000   | 0.8s       | 0.8s (1.00x)  | 1.7s       |
0.9s (1.89x)  | 2.1s        | 1.2s (1.75x)   |
| 128         | 100000   | 1.5s       | 1.6s (0.94x)  | 4.5s       |
1.9s (2.37x)  | 7.8s        | 2.1s (3.71x)   |
| 256         | 100000   | 3.3s       | 3.1s (1.06x)  | 9.7s       |
4.1s (2.37x)  | 22.0s       | 4.5s (4.89x)   |


We can notice that with low concurrency (1,2,4) results are similar
while with medium concurrency (8,16)
the No-ISB approach can introduce some regression especially on
smaller instances. However, we can see some significant
positive performance impact with high concurrency (>=32) settings on
large instances (up to 8.76x on m7g.16xl with 256 concurrency).



Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Nathan Bossart
Дата:
On Thu, Jun 19, 2025 at 12:10:52PM -0700, Salvatore Dipietro wrote:
> We can notice that with low concurrency (1,2,4) results are similar
> while with medium concurrency (8,16)
> the No-ISB approach can introduce some regression especially on
> smaller instances. However, we can see some significant
> positive performance impact with high concurrency (>=32) settings on
> large instances (up to 8.76x on m7g.16xl with 256 concurrency).

Given these mixed results, it's unclear to me how exactly we should
proceed.  Perhaps there is another approach that reduces the regressions to
a negligible level while still producing gains at higher levels of
concurrency.  Or maybe we can convince ourselves that these regressions
aren't worth worrying about, but that seems like a bit of a stretch to me.

-- 
nathan



Nathan Bossart <nathandbossart@gmail.com> writes:
> On Thu, Jun 19, 2025 at 12:10:52PM -0700, Salvatore Dipietro wrote:
>> We can notice that with low concurrency (1,2,4) results are similar
>> while with medium concurrency (8,16)
>> the No-ISB approach can introduce some regression especially on
>> smaller instances. However, we can see some significant
>> positive performance impact with high concurrency (>=32) settings on
>> large instances (up to 8.76x on m7g.16xl with 256 concurrency).

> Given these mixed results, it's unclear to me how exactly we should
> proceed.  Perhaps there is another approach that reduces the regressions to
> a negligible level while still producing gains at higher levels of
> concurrency.  Or maybe we can convince ourselves that these regressions
> aren't worth worrying about, but that seems like a bit of a stretch to me.

I agree; I'm also worried that this may be optimizing for one
particular ARM implementation and have negative effects on others.

Perhaps a better way forward is to identify which spinlock is
suffering such huge contention and try to alleviate that at a
higher design level.  (That could benefit more than just ARM, too.)

            regards, tom lane



Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Álvaro Herrera
Дата:
On 2025-May-01, Tom Lane wrote:

> One other thing that comes to mind is that pg_stat_statements
> has stretched the intention of "short straight-line code segment"
> to the point of unrecognizability.  Maybe it needs to stop using
> spinlocks to protect pgssEntry updates.  But I'm not sure if that
> would move the needle on whether ISB is a good idea or not.

Yeah, it looks like pgss_store() is being too generous on the amount of
code run with that spinlock held.  However, changing that spinlock to an
lwlock doesn't look easy, because of the way each pgss entry is created
as a dynahash entry, and then deallocated from there.  With spinlocks we
can just reinit the spinlock each time, but that doesn't work with
lwlocks.  We have no easy way to associate then disassociate each entry
from a specific lwlock.

Maybe we'd need to do
RequestNamedLWLockTranche("pg_stat_statements", 1 + pgss_max)
at startup, then put them all in an lwlock-freelist: a list of unused
lwlocks in this tranche, to take one on entry_alloc and put it back on
entry_dealloc.

I think Salvatore's comment that enabling track_planning makes the
performance gain more pronounced is evidence that changing this spinlock
to lwlock would be beneficial.

Another benefit of using an lwlock: pg_stat_statements_internal() could
use shared mode to read each entry.  Probably not as critical (because
it requires the view to be read), but it'd reduce the performance hit
when the user is reading the view.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)



Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Nathan Bossart
Дата:
On Thu, Aug 14, 2025 at 11:29:08AM +0200, Álvaro Herrera wrote:
> On 2025-May-01, Tom Lane wrote:
>> One other thing that comes to mind is that pg_stat_statements
>> has stretched the intention of "short straight-line code segment"
>> to the point of unrecognizability.  Maybe it needs to stop using
>> spinlocks to protect pgssEntry updates.  But I'm not sure if that
>> would move the needle on whether ISB is a good idea or not.
> 
> Yeah, it looks like pgss_store() is being too generous on the amount of
> code run with that spinlock held.  However, changing that spinlock to an
> lwlock doesn't look easy, because of the way each pgss entry is created
> as a dynahash entry, and then deallocated from there.  With spinlocks we
> can just reinit the spinlock each time, but that doesn't work with
> lwlocks.  We have no easy way to associate then disassociate each entry
> from a specific lwlock.

I've attempted multiple times now to replace this spinlock with LWLocks or
atomics, but thus far haven't produced anything that improved matters [0].
IIRC I just dynamically created LWLocks with LWLockInitialize() as needed,
which is tedious but easy enough.  My recollection is that the use of
floats for many of the values greatly complicated the atomics attempts.
So, fixing it might require some larger changes.

[0] https://postgr.es/m/Zs4hJ6-Fg8DMgU_P%40nathan

-- 
nathan



Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Andres Freund
Дата:
Hi,

On 2025-08-14 11:29:08 +0200, Álvaro Herrera wrote:
> On 2025-May-01, Tom Lane wrote:
> 
> > One other thing that comes to mind is that pg_stat_statements
> > has stretched the intention of "short straight-line code segment"
> > to the point of unrecognizability.  Maybe it needs to stop using
> > spinlocks to protect pgssEntry updates.  But I'm not sure if that
> > would move the needle on whether ISB is a good idea or not.
> 
> Yeah, it looks like pgss_store() is being too generous on the amount of
> code run with that spinlock held.

Indeed. To the point that these days pgss is basically unusable for workloads
with many short queries.


> However, changing that spinlock to an lwlock doesn't look easy, because of
> the way each pgss entry is created as a dynahash entry, and then deallocated
> from there.  With spinlocks we can just reinit the spinlock each time, but
> that doesn't work with lwlocks.  We have no easy way to associate then
> disassociate each entry from a specific lwlock.

I'm not following? The lwlock can just be inside the struct, just like the
spinlock is? "Association" is just LWLockInitialize() and deassociation is not
needed.

Greetings,

Andres Freund



Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Nathan Bossart
Дата:
On Fri, Aug 15, 2025 at 01:39:52PM -0400, Andres Freund wrote:
> On 2025-08-14 11:29:08 +0200, Álvaro Herrera wrote:
>> However, changing that spinlock to an lwlock doesn't look easy, because of
>> the way each pgss entry is created as a dynahash entry, and then deallocated
>> from there.  With spinlocks we can just reinit the spinlock each time, but
>> that doesn't work with lwlocks.  We have no easy way to associate then
>> disassociate each entry from a specific lwlock.
> 
> I'm not following? The lwlock can just be inside the struct, just like the
> spinlock is? "Association" is just LWLockInitialize() and deassociation is not
> needed.

Indeed.  I rebased an old patch that I had lying around to demonstrate.  If
my past testing [0] is to be trusted, this actually hurts performance,
unfortunately.

[0] https://postgr.es/m/Zs4hJ6-Fg8DMgU_P@nathan

-- 
nathan

Вложения

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Andres Freund
Дата:
Hi,

On 2025-08-15 12:57:52 -0500, Nathan Bossart wrote:
> On Fri, Aug 15, 2025 at 01:39:52PM -0400, Andres Freund wrote:
> > On 2025-08-14 11:29:08 +0200, Álvaro Herrera wrote:
> >> However, changing that spinlock to an lwlock doesn't look easy, because of
> >> the way each pgss entry is created as a dynahash entry, and then deallocated
> >> from there.  With spinlocks we can just reinit the spinlock each time, but
> >> that doesn't work with lwlocks.  We have no easy way to associate then
> >> disassociate each entry from a specific lwlock.
> > 
> > I'm not following? The lwlock can just be inside the struct, just like the
> > spinlock is? "Association" is just LWLockInitialize() and deassociation is not
> > needed.
> 
> Indeed.  I rebased an old patch that I had lying around to demonstrate.  If
> my past testing [0] is to be trusted, this actually hurts performance,
> unfortunately.

FWIW, rather interesting result of testing the patch briefly:

On my older workstation, the patch is a substantial *gain* when there's a lot
of contention. But on my newer workstation it's a *loss*.

The penalty from enabling pg_stat_statements for readonly pgbench on the newer
workstation is rather bad - about 1/3 the throughput.


I think the main reason that lwlocks loose on the newer machine is that we
loose spinning. The newer machine has more cores and more numa domains and the
fairer locks lead to more cacheline pingpong...


IMO, the only way to actually make pg_stat_statements scale is to move to a
model much more like our regular stats. I.e. accumulate counters in backend
local memory and only occasionally update the shared stats. Even if you were
to move pgss successfully to atomics, the cacheline contention still would be
terrible for performance.

FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns*
compared to using the spinlocks. You'd replace one atomic operation with
dozens, to update all those fields individually. With loads of cacheline
pingpong inbetween.

Greetings,

Andres Freund



Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Nathan Bossart
Дата:
On Fri, Aug 15, 2025 at 04:13:30PM -0400, Andres Freund wrote:
> IMO, the only way to actually make pg_stat_statements scale is to move to a
> model much more like our regular stats. I.e. accumulate counters in backend
> local memory and only occasionally update the shared stats.

Agreed.  I remember discussing something similar at pgconf.dev this year.

> FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns*
> compared to using the spinlocks. You'd replace one atomic operation with
> dozens, to update all those fields individually. With loads of cacheline
> pingpong inbetween.

Right.  At some point I tried moving most things to atomics and leaving the
rest behind the spinlock, and IIRC there wasn't really any improvement.  I
didn't dig into whether that was because of atomic-related cache line
ping-pong or the existing spinlock, but regardless, I quickly discarded
that idea.

-- 
nathan



Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Nathan Bossart
Дата:
On Fri, Aug 15, 2025 at 03:25:20PM -0500, Nathan Bossart wrote:
> On Fri, Aug 15, 2025 at 04:13:30PM -0400, Andres Freund wrote:
>> IMO, the only way to actually make pg_stat_statements scale is to move to a
>> model much more like our regular stats. I.e. accumulate counters in backend
>> local memory and only occasionally update the shared stats.
> 
> Agreed.  I remember discussing something similar at pgconf.dev this year.

BTW I'm planning to give this a try...

-- 
nathan



Andres Freund <andres@anarazel.de> writes:
> FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns*
> compared to using the spinlocks. You'd replace one atomic operation with
> dozens, to update all those fields individually. With loads of cacheline
> pingpong inbetween.

Not only that, but you'd no longer have any semblance of read-consistency
between the fields of a stats entry.

            regards, tom lane



Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Michael Paquier
Дата:
On Fri, Aug 15, 2025 at 03:54:29PM -0500, Nathan Bossart wrote:
> On Fri, Aug 15, 2025 at 03:25:20PM -0500, Nathan Bossart wrote:
>> On Fri, Aug 15, 2025 at 04:13:30PM -0400, Andres Freund wrote:
>>> IMO, the only way to actually make pg_stat_statements scale is to move to a
>>> model much more like our regular stats. I.e. accumulate counters in backend
>>> local memory and only occasionally update the shared stats.
>>
>> Agreed.  I remember discussing something similar at pgconf.dev this year.
>
> BTW I'm planning to give this a try...

This means to use the pluggable pgstats API to achieve that.  FWIW,
one difficulty I am foreseeing is how we want to cap and control the
number of PGSS entries that are saved in the pgstats hash table, based
on the current max PGSS can be set to.  Just saying that we'll most
likely rely on a separate zone of shared memory to track this
information.  Most of that should happen in the flush callback, based
on the current infra in place, I guess.

More stats are good to have, still capping them is key to keep the
backend under control, and it does not make sense to me to use a fixed
chunk of shared memory for this side (meaning a fixed-sized pgstats
kind for PGSS).  Another benefit is to make the max reloadable.
--
Michael

Вложения

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

От
Nathan Bossart
Дата:
On Mon, Aug 11, 2025 at 04:12:48PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> On Thu, Jun 19, 2025 at 12:10:52PM -0700, Salvatore Dipietro wrote:
>>> We can notice that with low concurrency (1,2,4) results are similar
>>> while with medium concurrency (8,16)
>>> the No-ISB approach can introduce some regression especially on
>>> smaller instances. However, we can see some significant
>>> positive performance impact with high concurrency (>=32) settings on
>>> large instances (up to 8.76x on m7g.16xl with 256 concurrency).
> 
>> Given these mixed results, it's unclear to me how exactly we should
>> proceed.  Perhaps there is another approach that reduces the regressions to
>> a negligible level while still producing gains at higher levels of
>> concurrency.  Or maybe we can convince ourselves that these regressions
>> aren't worth worrying about, but that seems like a bit of a stretch to me.
> 
> I agree; I'm also worried that this may be optimizing for one
> particular ARM implementation and have negative effects on others.

Based on this discussion, I am marking the patch as Returned with Feedback.

-- 
nathan