Обсуждение: Re: use a non-locking initial test in TAS_SPIN on AArch64
Hi~
Upon closer inspection, I noticed that we don't implement a custom
TAS_SPIN() for this architecture, so I quickly hacked together the attached
patch and ran a couple of benchmarks that stressed the spinlock code. I
found no discussion about TAS_SPIN() on ARM in the archives, but I did
notice that the initial AArch64 support was added [0] before x86_64 started
using a non-locking test [1].
It reminds me of a discussion about improving spinlock performance on ARM
in 2020 [0], though the discussion is about CAS and TAS, not TAS_SPIN() itself.
in 2020 [0], though the discussion is about CAS and TAS, not TAS_SPIN() itself.
tps = 74135.100891 (without initial connection time)
tps = 549462.785554 (without initial connection time)
The result looks great, but the discussion in [0] shows that the result may
vary among different ARM chips. Could you provide the chip model of this
test? So that we can do a cross validation of this patch. Not sure if compiler
version is necessary too. I'm willing to test it on Alibaba Cloud Yitian 710test? So that we can do a cross validation of this patch. Not sure if compiler
if I have time.
- Regards
Jingtang
On Wed, Oct 23, 2024 at 11:01:05AM +0800, Jingtang Zhang wrote: > The result looks great, but the discussion in [0] shows that the result may > vary among different ARM chips. Could you provide the chip model of this > test? So that we can do a cross validation of this patch. This is on a c8g.24xlarge, which is using Neoverse-V2 and Armv9.0-a [0]. > I'm willing to test it on Alibaba Cloud Yitian 710 if I have time. That would be great. I have a couple of Apple M-series machines I can test, too. [0] https://github.com/aws/aws-graviton-getting-started/blob/main/README.md#building-for-graviton -- nathan
On Wed, Oct 23, 2024 at 09:46:56AM -0500, Nathan Bossart wrote: > I have a couple of Apple M-series machines I can test, too. After some preliminary tests on an M3, I'm not seeing any gains outside the noise range. That's not too surprising because it's likely more difficult to create a lot of spinlock contention on these smaller machines. But, at the very least, I'm not seeing a regression. -- nathan
Hi, Nathan. I just realized that I almost forgot about this thread :) > The result looks great, but the discussion in [0] shows that the result may > vary among different ARM chips. Could you provide the chip model of this > test? So that we can do a cross validation of this patch. Not sure if compiler > version is necessary too. I'm willing to test it on Alibaba Cloud Yitian 710 > if I have time. I did some benchmark on Yitian 710. On c8y.16xlarge (64 cores): Without the patch: 80.31% postgres [.] __aarch64_swp4_acq 1.77% postgres [.] __aarch64_ldadd4_acq_rel 1.13% postgres [.] hash_search_with_hash_value 0.87% pg_stat_statements.so [.] __aarch64_swp4_acq 0.72% postgres [.] perform_spin_delay 0.44% postgres [.] _bt_compare tps = 295272.628421 (including connections establishing) tps = 295335.660323 (excluding connections establishing) Patched: 9.94% postgres [.] s_lock 6.07% postgres [.] __aarch64_swp4_acq 5.73% postgres [.] hash_search_with_hash_value 2.81% postgres [.] perform_spin_delay 2.29% postgres [.] _bt_compare 2.15% postgres [.] PinBuffer tps = 864519.764125 (including connections establishing) tps = 864638.244443 (excluding connections establishing) Seems that great performance could be gained if s_lock contention is severe. This may be more likely to happen on bigger machines. On c8y.2xlarge (8 cores), I failed to make s_lock contended severely, and as a result this patch didn’t bring any difference outside the noise. Regards, Jingtang
On Wed, Jan 15, 2025 at 07:50:38PM +0800, Jingtang Zhang wrote: > Seems that great performance could be gained if s_lock contention is severe. > This may be more likely to happen on bigger machines. > > On c8y.2xlarge (8 cores), I failed to make s_lock contended severely, and > as a result this patch didn´t bring any difference outside the noise. Thanks for sharing. -- nathan