Обсуждение: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Поиск
Список
Период
Сортировка
Hi all,

Dave and I have been working together to get ARM64 with MSVC functional.
 The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

PostgreSQL when compiled with MSVC on ARM64 architecture in particular
when optimizations are enabled (e.g., /O2), fails 027_stream_regress.
After some investigation and analysis of generated assembly code, Dave
Cramer and I have identified that the root cause is insufficient memory
barrier semantics in both atomic operations and spinlocks on ARM64 when
compiled with MSVC with /O2.

Dave knew I was in the process of setting up a Win11/ARM64/MSVC build
animal and pinged me with this issue.  Dave got me started on the path
to finding the issue by sending me his work around:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -744,6 +744,7 @@ static void
WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * before the data page can be written out.  This implements the basic
  * WAL rule "write the log before the data".)
  */
+#pragma optimize("",off)
 XLogRecPtr
 XLogInsertRecord(XLogRecData *rdata,
                                 XLogRecPtr fpw_lsn,
@@ -1088,7 +1089,7 @@ XLogInsertRecord(XLogRecData *rdata,

        return EndPos;
 }
-
+#pragma optimize("",on)
 /*


This pointed a finger at the atomics, so I started there.  We used a few
tools, but worth noting is https://godbolt.org/ where we were able to
quickly see that the MSVC assembly was missing the "dmb" barriers on
this platform.  I'm not sure how long this link will be valid, but in
the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn


PROBLEM DESCRIPTION

PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
manifesting as timing-dependent failures in critical sections
protected by spinlocks and atomic variables. The failures are
reproducible when the test suite is compiled with optimization flags
(/O2), particularly in the recovery/027_stream_regress test which
involves WAL replication and standby recovery.

The root cause has two components:

1. Atomic operations lack memory barriers on ARM64
2. MSVC spinlock implementation lacks memory barriers on ARM64


TECHNICAL ANALYSIS

PART 1: ATOMIC OPERATIONS MEMORY BARRIERS

GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
generates a call to __aarch64_cas4_acq_rel(), which is a library
function that provides explicit acquire-release memory ordering
semantics through either:

* LSE path (modern ARM64): Using CASAL instruction with built-in
  memory ordering [1][2]

* Legacy path (older ARM64): Using LDAXR/STLXR instructions with
  explicit dmb sy instruction [3]

MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4][5].


PART 2: SPINLOCK IMPLEMENTATION LACKS BARRIERS

The MSVC spinlock implementation in src/include/storage/s_lock.h had
two issues on ARM64/MSVC:

#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
#define S_UNLOCK(lock) do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)

Issue 1: TAS() uses InterlockedCompareExchange without hardware barriers

The InterlockedCompareExchange intrinsic lacks full memory barrier
semantics on ARM64, identical to the atomic operations issue.

Issue 2: S_UNLOCK() uses only a compiler barrier

_ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
barrier [6].  It prevents the compiler from reordering operations, but
the CPU can still reorder memory operations. This is fundamentally
insufficient for ARM64's weaker memory model.

For comparison, GCC's __sync_lock_release() emits actual hardware
barriers.

IMPACT ON 027_STREAM_REGRESS

The 027_stream_regress test involves WAL replication and standby
recovery — heavily dependent on synchronized access to shared memory
protected by spinlocks [7].  Without proper barriers on ARM64:

1. Thread A acquires spinlock (no full barrier emitted)
2. Thread A modifies shared WAL buffer
3. Thread B acquires spinlock before Thread A's writes become visible
4. Thread B reads stale WAL data
5. WAL replication gets corrupted or hangs indefinitely
6. Test times out waiting for standby to catch up


WHY ARM32 AND X86/X64 ARE UNAFFECTED

MSVC's _InterlockedCompareExchange does provide full memory barriers on:
* x86/x64: Memory barriers are implicit in the x86 memory model [8]
* ARM32: MSVC explicitly generates full barriers for ARM32 [5]

Only ARM64 lacks the necessary barriers, making this a platform-specific
issue.

ATTACHED SOLUTION

Add explicit DMB (Data Memory Barrier) instructions before and after
atomic operations and spinlock operations on ARM64 to provide sequential
consistency semantics.

0002: src/inclue/port/atomic/generic-msvc.h

Added platform-specific DMB macros that expand to
__dmb(_ARM64_BARRIER_SY) on ARM64.

Applied to all six atomic operations:
* pg_atomic_compare_exchange_u32_impl()
* pg_atomic_exchange_u32_impl()
* pg_atomic_fetch_add_u32_impl()
* pg_atomic_compare_exchange_u64_impl()
* pg_atomic_exchange_u64_impl()
* pg_atomic_fetch_add_u64_impl()


0001: src/include/storage/s_lock.h

Added ARM64-specific spinlock implementation with explicit DMB barriers [9]:

#if defined(_M_ARM64)
#define TAS(lock) tas_msvc_arm64(lock)

static __forceinline int
tas_msvc_arm64(volatile slock_t *lock)
{
  int result;

  /* Full barrier before atomic operation */
  __dmb(_ARM64_BARRIER_SY);

  /* Atomic compare-and-swap */
  result = InterlockedCompareExchange(lock, 1, 0);

  /* Full barrier after atomic operation */
  __dmb(_ARM64_BARRIER_SY);

  return result;
}

#define S_UNLOCK(lock)
do {
  __dmb(_ARM64_BARRIER_SY); /* Full barrier before release /
  ((lock)) = 0;
} while (0)

#else
  /* Non-ARM64 MSVC: existing implementation unchanged */
#endif


The spinlock acquire now ensures:

* Before CAS: All prior memory operations complete before
  acquiring the lock.

* After CAS: The CAS completes before subsequent operations
  access protected data

The spinlock release now ensures:

* Before writing 0: All critical section operations are visible
  to other threads


You may ask: why two DMBs in the atomic operations instead of one?
GCC's non-LSE path (LDAXR/STLXR) uses only one DMB because:
* LDAXR (Load-Acquire Exclusive) provides half-barrier acquire
  semantics [3]
* STLXR (Store-Release Exclusive) provides half-barrier release
  semantics [3]
* One final dmb sy upgrades to full sequential consistency

Since _InterlockedCompareExchange provides NO barrier semantics on
ARM64, we must provide both halves:

* First DMB acts as a release barrier (ensures prior memory ops
  complete before CAS)
* Second DMB acts as an acquire barrier (ensures subsequent memory
  ops wait for CAS)
* Together they provide sequential consistency matching GCC's
  semantics [3]


VERIFICATION

The fix has been verified by:

1. Spinlock fix resolves 027_stream_regress timeout: Test now passes
   consistently on MSVC ARM64 with /O2 optimization without hanging

2. Assembly code inspection: Confirmed that dmb sy instructions now
   appear in the optimized assembly for ARM64 builds

3. Platform compatibility: No regression on x86/x64 or ARM32 (macros
   expand to no-ops; original code path unchanged)


WHY CLANG/LLVM ON MACOS ARM64 DOESN'T HAVE THIS PROBLEM

PostgreSQL builds successfully on Apple Silicon Macs (ARM64) without
the memory ordering issues observed on MSVC Windows ARM64. The
difference comes down to how Clang/LLVM and MSVC handle atomic
operations.

CLANG/LLVM APPROACH (macOS, Linux, Android ARM64)

Clang/LLVM uses GCC-compatible atomic builtins
(__atomic_compare_exchange_n, etc.) even on platforms where it's not
GCC [125][134]. The LLVM backend has an AtomicExpand pass that
properly expands these operations to include appropriate memory
barriers for the target architecture [134].

On ARM64, Clang generates:

__aarch64_cas4_acq_rel library calls (or CASAL instruction with LSE)
Proper acquire-release semantics built into the instruction sequence
Automatic full dmb sy barriers where needed This means PostgreSQL's
use of __sync_lock_test_and_set and _atomic* builtins work correctly
on macOS ARM64 without additional patches.


Phew... I hope I read all those docs correctly and got that right.  Feel
free to let me know if I missed something.  Looking forward to your
feedback and review so I can get this new build animal up and running.

best.

-greg

[1] ARM Developer: CAS Instructions
https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64-

[2] ARM Developer: Load-Acquire and Store-Release Instructions
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

[3] ARM Developer: Data Memory Barrier (DMB)
https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en

[4] Microsoft Learn: _InterlockedCompareExchange Intrinsic Functions
https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170

[5] Microsoft Learn: ARM Intrinsics - Memory Barriers
https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170

[6] Microsoft Learn: _ReadWriteBarrier is a Compiler Barrier
https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170

[7] PostgreSQL: 027_stream_regress WAL replication testing
https://www.postgresql.org/message-id/193115.1763243897@sss.pgh.pa.us

[8] Intel Volume 3A: Memory Ordering

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

[9] Microsoft Developer Blog: The AArch64 processor - Barriers
https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968
Вложения
On 20.11.25 21:45, Greg Burd wrote:
> Dave and I have been working together to get ARM64 with MSVC functional.
>   The attached patches accomplish that. Dave is the author of the first
> which addresses some build issues and fixes the spin_delay() semantics,
> I did the second which fixes some atomics in this combination.

 > -  zlib_t = dependency('zlib', required: zlibopt)
 > +  zlib_t = dependency('zlib', method : 'pkg-config', required: zlibopt)

This appears to be a change unrelated to your patch description.

Also, the second patch contains a number of random whitespace changes. 
It would be good to clean that up so the patch is easier to analyze.




On Nov 20 2025, at 5:00 pm, Peter Eisentraut <peter@eisentraut.org> wrote:

> On 20.11.25 21:45, Greg Burd wrote:
>> Dave and I have been working together to get ARM64 with MSVC functional.
>>   The attached patches accomplish that. Dave is the author of the first
>> which addresses some build issues and fixes the spin_delay() semantics,
>> I did the second which fixes some atomics in this combination.
> 

Hi Peter,

Thanks for taking a second to review.

> > -  zlib_t = dependency('zlib', required: zlibopt)
> > +  zlib_t = dependency('zlib', method : 'pkg-config', required: zlibopt)
> 
> This appears to be a change unrelated to your patch description.

Hmmm, these are changes that Dave added to get things to compile.  They
worked for me but I'll review them again in the morning and update the description.

> Also, the second patch contains a number of random whitespace changes. 
> It would be good to clean that up so the patch is easier to analyze.

Certainly, apologies.  I'll clean that up first thing in the morning.

-greg



I took a quick look at 0001.

+#ifdef _MSC_VER
+#include <intrin.h>
+#else
 #include <arm_acle.h>
 unsigned int crc;

I think you can remove this since we unconditionally do the runtime check
for MSVC.  In any case, the missing #endif seems likely to cause
problems.

--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
  */
 #include "c.h"
 
+#ifndef _MSC_VER
 #include <arm_acle.h>
+#endif

Hm.  Doesn't MSVC require intrin.h?

-- 
nathan



Hi,

On 2025-11-20 15:45:22 -0500, Greg Burd wrote:
> Dave and I have been working together to get ARM64 with MSVC functional.
>  The attached patches accomplish that. Dave is the author of the first
> which addresses some build issues and fixes the spin_delay() semantics,
> I did the second which fixes some atomics in this combination.

Thanks for working on this!


> This pointed a finger at the atomics, so I started there.  We used a few
> tools, but worth noting is https://godbolt.org/ where we were able to
> quickly see that the MSVC assembly was missing the "dmb" barriers on
> this platform.  I'm not sure how long this link will be valid, but in
> the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn
> 
> 
> PROBLEM DESCRIPTION
> 
> PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
> manifesting as timing-dependent failures in critical sections
> protected by spinlocks and atomic variables. The failures are
> reproducible when the test suite is compiled with optimization flags
> (/O2), particularly in the recovery/027_stream_regress test which
> involves WAL replication and standby recovery.
> 
> The root cause has two components:
> 
> 1. Atomic operations lack memory barriers on ARM64
> 2. MSVC spinlock implementation lacks memory barriers on ARM64
> 
> TECHNICAL ANALYSIS
> 
> PART 1: ATOMIC OPERATIONS MEMORY BARRIERS
> 
> GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
> generates a call to __aarch64_cas4_acq_rel(), which is a library
> function that provides explicit acquire-release memory ordering
> semantics through either:
> 
> * LSE path (modern ARM64): Using CASAL instruction with built-in
>   memory ordering [1][2]
> 
> * Legacy path (older ARM64): Using LDAXR/STLXR instructions with
>   explicit dmb sy instruction [3]
> 
> MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
> atomic operation but does NOT emit the necessary Data Memory Barrier
> (DMB) instructions [4][5].

I couldn't reproduce this result when playing around on godbolt. By specifying
/arch:armv9.4 msvc can be convinced to emit the code for the intrinsics inline
(at least for most of them).  And that makes it visible that
_InterlockedCompareExchange() results in a "casal" instruction. Looking that
up shows:

https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
which includes these two statements:
"CASA and CASAL load from memory with acquire semantics."
"CASL and CASAL store to memory with release semantics."


> Issue 2: S_UNLOCK() uses only a compiler barrier
> 
> _ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
> barrier [6].  It prevents the compiler from reordering operations, but
> the CPU can still reorder memory operations. This is fundamentally
> insufficient for ARM64's weaker memory model.

Yea, that seems broken on a non-TSO architecture.  Is the problem fixed if you
change just this to include a proper barrier?

Greetings,

Andres Freund



Hi,

On 2025-11-20 19:03:47 -0500, Andres Freund wrote:
> > MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
> > atomic operation but does NOT emit the necessary Data Memory Barrier
> > (DMB) instructions [4][5].
> 
> I couldn't reproduce this result when playing around on godbolt. By specifying
> /arch:armv9.4 msvc can be convinced to emit the code for the intrinsics inline
> (at least for most of them).  And that makes it visible that
> _InterlockedCompareExchange() results in a "casal" instruction. Looking that
> up shows:
>
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
> which includes these two statements:
> "CASA and CASAL load from memory with acquire semantics."
> "CASL and CASAL store to memory with release semantics."

Further evidence for that is that
https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange
states:
"This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order."

(note that we are using the function, not the intrinsic for TAS())

Greetings,

Andres



On Fri, Nov 21, 2025 at 9:45 AM Greg Burd <greg@burd.me> wrote:
> Dave and I have been working together to get ARM64 with MSVC functional.
>  The attached patches accomplish that. Dave is the author of the first
> which addresses some build issues and fixes the spin_delay() semantics,
> I did the second which fixes some atomics in this combination.

A couple of immediate thoughts:

https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedexchangeadd-intrinsic-functions?view=msvc-170

Doesn't seem to match your conclusion.

+  if cc.get_id() == 'msvc'
+    cdata.set('USE_ARMV8_CRC32C', false)
+    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)

USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK (pg_crc32c_armv8_choose.c) won't
actually work on Windows, but I don't think we should waste time
implementing it: vendor-supported versions of Windows 11 require
ARMv8.1A to boot[1][2], and that has it, so I think we should probably
just define USE_ARMV8_CRC32C.

+static __forceinline void
+spin_delay(void)
+{
+     /* Reference:
https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions
*/
+    __isb(_ARM64_BARRIER_SY);
+}

I don't doubt that barriers are missing in a few places, but how can
this be the right place?

If you have an environment set up so it's easy to test, I would also
be very interested to know if my patch set[3] that nukes all this
stuff and includes <stdatomic.h> instead, which is green on
Windows/x86 CI, will just work™ there too.

[1] https://en.wikipedia.org/wiki/Windows_11_version_history
[2] https://learn.microsoft.com/en-us/lifecycle/products/windows-11-home-and-pro
[3] https://www.postgresql.org/message-id/flat/CA%2BhUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA%40mail.gmail.com




On Thu, 20 Nov 2025 at 17:36, Nathan Bossart <nathandbossart@gmail.com> wrote:
I took a quick look at 0001.

+#ifdef _MSC_VER
+#include <intrin.h>
+#else
 #include <arm_acle.h>
 unsigned int crc;

I think you can remove this since we unconditionally do the runtime check
for MSVC.  In any case, the missing #endif seems likely to cause
problems.

--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
  */
 #include "c.h"

+#ifndef _MSC_VER
 #include <arm_acle.h>
+#endif

Hm.  Doesn't MSVC require intrin.h?

--
nathan


Dave 
On Nov 20 2025, at 7:03 pm, Andres Freund <andres@anarazel.de> wrote:

> Hi,
> 
> On 2025-11-20 15:45:22 -0500, Greg Burd wrote:
>> Dave and I have been working together to get ARM64 with MSVC functional.
>>  The attached patches accomplish that. Dave is the author of the first
>> which addresses some build issues and fixes the spin_delay() semantics,
>> I did the second which fixes some atomics in this combination.
> 
> Thanks for working on this!

You're welcome, thanks for reviewing it. :)

>> 
>> MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
>> atomic operation but does NOT emit the necessary Data Memory Barrier
>> (DMB) instructions [4][5].
> 
> I couldn't reproduce this result when playing around on godbolt. By specifying
> /arch:armv9.4 msvc can be convinced to emit the code for the
> intrinsics inline
> (at least for most of them).  And that makes it visible that
> _InterlockedCompareExchange() results in a "casal" instruction.
> Looking that
> up shows:
>
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
> which includes these two statements:
> "CASA and CASAL load from memory with acquire semantics."
> "CASL and CASAL store to memory with release semantics."

I didn't even think to check for a compiler flag for the architecture,
nice call!  If this emits the correct instructions it is a much better
approach.  I'll give it a try, thanks for the nudge.

>> Issue 2: S_UNLOCK() uses only a compiler barrier
>> 
>> _ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
>> barrier [6].  It prevents the compiler from reordering operations, but
>> the CPU can still reorder memory operations. This is fundamentally
>> insufficient for ARM64's weaker memory model.
> 
> Yea, that seems broken on a non-TSO architecture.  Is the problem
> fixed if you change just this to include a proper barrier?

Using the flag from above the _ReadWriteBarrier() does (in godbolt) turn
into a casal which (AFAIK) is going to do the trick.  I'll see if I can
update meson.build and get this work as intended.

> Greetings,
> 
> Andres Freund

best.

-greg



On Nov 20 2025, at 7:07 pm, Andres Freund <andres@anarazel.de> wrote:

> Hi,
> 
> On 2025-11-20 19:03:47 -0500, Andres Freund wrote:
>> > MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
>> > atomic operation but does NOT emit the necessary Data Memory Barrier
>> > (DMB) instructions [4][5].
>> 
>> I couldn't reproduce this result when playing around on godbolt. By specifying
>> /arch:armv9.4 msvc can be convinced to emit the code for the
>> intrinsics inline
>> (at least for most of them).  And that makes it visible that
>> _InterlockedCompareExchange() results in a "casal" instruction.
>> Looking that
>> up shows:
>>
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
>> which includes these two statements:
>> "CASA and CASAL load from memory with acquire semantics."
>> "CASL and CASAL store to memory with release semantics."
> 
> Further evidence for that is that
> https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange
> states:
> "This function generates a full memory barrier (or fence) to ensure
> that memory operations are completed in order."
> 
> (note that we are using the function, not the intrinsic for TAS())

Got it, thanks.

> Greetings,
> 
> Andres

best.

-greg



On Nov 20 2025, at 5:36 pm, Nathan Bossart <nathandbossart@gmail.com> wrote:

> I took a quick look at 0001.

Thanks for taking a second to review!

> +#ifdef _MSC_VER
> +#include <intrin.h>
> +#else
> #include <arm_acle.h>
> unsigned int crc;
> 
> I think you can remove this since we unconditionally do the runtime check
> for MSVC.  In any case, the missing #endif seems likely to cause
> problems.
> 
> --- a/src/port/pg_crc32c_armv8.c
> +++ b/src/port/pg_crc32c_armv8.c
> @@ -14,7 +14,9 @@
>  */
> #include "c.h"
> 
> +#ifndef _MSC_VER
> #include <arm_acle.h>
> +#endif
> 
> Hm.  Doesn't MSVC require intrin.h?

It does in fact fail to compile without this part of the patch, I think
Dave posted a bug about this.  I added the missing endif, thanks!

> -- 
> nathan

best.

-greg



On Nov 20 2025, at 7:08 pm, Thomas Munro <thomas.munro@gmail.com> wrote:

> On Fri, Nov 21, 2025 at 9:45 AM Greg Burd <greg@burd.me> wrote:
>> Dave and I have been working together to get ARM64 with MSVC functional.
>>  The attached patches accomplish that. Dave is the author of the first
>> which addresses some build issues and fixes the spin_delay() semantics,
>> I did the second which fixes some atomics in this combination.
>
> A couple of immediate thoughts:

Hey Thomas, I'm always interested in your thoughts.  Thanks for taking a
look at this.

> https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedexchangeadd-intrinsic-functions?view=msvc-170
>
> Doesn't seem to match your conclusion.
>
> +  if cc.get_id() == 'msvc'
> +    cdata.set('USE_ARMV8_CRC32C', false)
> +    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>
> USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK (pg_crc32c_armv8_choose.c) won't
> actually work on Windows, but I don't think we should waste time
> implementing it: vendor-supported versions of Windows 11 require
> ARMv8.1A to boot[1][2], and that has it, so I think we should probably
> just define USE_ARMV8_CRC32C.

I'll give that a go, I think your logic is sound.

> +static __forceinline void
> +spin_delay(void)
> +{
> +     /* Reference:
> https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions
> */
> +    __isb(_ARM64_BARRIER_SY);
> +}
>
> I don't doubt that barriers are missing in a few places, but how can
> this be the right place?

Well, with the compiler flag that Andres mentioned I need to reconsider
this change.

> If you have an environment set up so it's easy to test, I would also
> be very interested to know if my patch set[3] that nukes all this
> stuff and includes <stdatomic.h> instead, which is green on
> Windows/x86 CI, will just work™ there too.

I do, and I will as soon as I get this sorted.

> [1] https://en.wikipedia.org/wiki/Windows_11_version_history
> [2] https://learn.microsoft.com/en-us/lifecycle/products/windows-11-home-and-pro
> [3] https://www.postgresql.org/message-id/flat/CA%2BhUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA%40mail.gmail.com

best.

-greg



Okay,

With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
had to update the S_UNLOCK() macro, the compiler did the rest correctly
AFAICT.  So, a much smaller patch (v2) attached. FWIW I'm using Visual
Studio 2026 (18) to build, other platform information below [1].

The white space/formatting issues seem to have been due to my ineptitude
on Windows running pgindent, or maybe I can blame Perl.  I'll try to dig
if I get a minute to figure out what was the cause and if I need to
patch something.

Also attached is a file with some notes on how I build. The build farm
section isn't finished yet, but there is value in the rest for anyone
doing similar work.  If you're wondering how I learned PowerShell, I
didn't.  I used AI, forgive me.  Eventually I'll have a build animal to
add to the "farm" and update a wiki page somewhere when this solidifies
a bit more. :)

best.

-greg

[1] OS Name    Microsoft Windows 11 Pro
Version    10.0.26100 Build 26100
Other OS Description     Not Available
OS Manufacturer    Microsoft Corporation
System Name    SANTORINI
System Manufacturer    Microsoft Corporation
System Model    Windows Dev Kit 2023
System Type    ARM64-based PC
System SKU    2043
Processor    Snapdragon Compute Platform, 2995 Mhz, 8 Core(s), 8 Logical Processor(s)
BIOS Version/Date    Microsoft Corporation 13.42.235, 11/13/2024
Вложения
On Thu, Nov 20, 2025, at 7:08 PM, Thomas Munro wrote:
> If you have an environment set up so it's easy to test, I would also
> be very interested to know if my patch set[3] that nukes all this
> stuff and includes <stdatomic.h> instead, which is green on
> Windows/x86 CI, will just work™ there too.
>
> [3]
> https://www.postgresql.org/message-id/flat/CA%2BhUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA%40mail.gmail.com

Hello Thomas,

I gave your stdatomic patchs a try, here's where it fell apart linking.

[174/1996] Linking target src/interfaces/libpq/libpq.dll
   Creating library src/interfaces/libpq\libpq.lib
[981/1996] Linking target src/interfaces/ecpg/pgtypeslib/libpgtypes.dll
   Creating library src\interfaces\ecpg\pgtypeslib\libpgtypes.lib
[1235/1996] Linking target src/interfaces/ecpg/ecpglib/libecpg.dll
   Creating library src\interfaces\ecpg\ecpglib\libecpg.lib
[1401/1996] Linking target src/backend/postgres.exe
FAILED: [code=1120] src/backend/postgres.exe src/backend/postgres.pdb
"link" @src/backend/postgres.exe.rsp
   Creating library src\backend\postgres.lib
storage_lmgr_s_lock.c.obj : error LNK2019: unresolved external symbol _mm_pause referenced in function
perform_spin_delay
src\backend\postgres.exe : fatal error LNK1120: 1 unresolved externals
[1410/1996] Compiling C object src/backend/snowball/dict_snowball.dll.p/libstemmer_stem_UTF_8_german.c.obj
ninja: build stopped: subcommand failed.

I used your v2 and the attached reduced set of Dave and my changes.  I'll work on this next week if I find a bit of
time,shouldn't be all that hard.  I thought I'd update you before then.  Also attached is my config script, I sent out
mynotes on a separate email. 

I forgot to provide the MSVC compiler/linker versions for posterity, they are:
  cl (msvc 19.50.35718 "Microsoft (R) C/C++ Optimizing Compiler Version 19.50.35718 for ARM64")
  link 14.50.35718.0

best.

-greg

Вложения
Hi,

On 2025-11-22 16:43:30 -0500, Greg Burd wrote:
> With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
> had to update the S_UNLOCK() macro, the compiler did the rest correctly
> AFAICT.

Just to be clear - the flag shouldn't be necessary for things to work
correctly. I was only using it to have godbolt inline the intrinsics, rather
than have them generate an out-of-line function call that I couldn't easily
inspect. I'm fairly sure that the out-of-line functions also have the relevant
barriers.


> @@ -2509,7 +2513,10 @@ int main(void)
>  }
>  '''
>  
> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> +  if cc.get_id() == 'msvc'
> +    cdata.set('USE_ARMV8_CRC32C', 1)
> +    have_optimized_crc = true

Should have a comment explaining why we can do this unconditionally...


Greetings,

Andres Freund



On Mon, Nov 24, 2025 at 4:55 AM Andres Freund <andres@anarazel.de> wrote:
> > -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> > +  if cc.get_id() == 'msvc'
> > +    cdata.set('USE_ARMV8_CRC32C', 1)
> > +    have_optimized_crc = true
>
> Should have a comment explaining why we can do this unconditionally...

I was wondering about that in light of your revelation that it must be
using /arch:armv8.0 by default.  If we only support Windows/ARM on
armv8.1 hardware (like Windows 11 itself), then I think that means
that we'd want /arch:armv8.1 here:

+  # Add ARM64 architecture flag for Windows 11 ARM64 for correct intrensics
+  if host_machine.system() == 'windows' and host_machine.cpu_family()
== 'aarch64'
+    add_project_arguments('/arch:armv9.4', language: ['c', 'cpp'])
+  endif

We can't impose our own random high ISA requirement like that, or some
machines will choke on illegal instructions.

I wonder if the same applies to Visual Studio on x86.  The OS now
requires x86-64-v2 (like RHEL9, and many other Linux distros except
Debian?), but Visual Studio might not know that... but then we might
say that's an issue for the EDB packaging crew to think about, not us.

In that case we might want to say "OK you choose the ARM version, but
we're not going to write the runtime test for CRC on armv8, we'll do a
compile-time test only, because it would be stupid to waste time
writing code for armv8.0".



On Nov 23 2025, at 10:55 am, Andres Freund <andres@anarazel.de> wrote:

> Hi,
> 
> On 2025-11-22 16:43:30 -0500, Greg Burd wrote:
>> With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
>> had to update the S_UNLOCK() macro, the compiler did the rest correctly
>> AFAICT.
> 
> Just to be clear - the flag shouldn't be necessary for things to work
> correctly. I was only using it to have godbolt inline the intrinsics, rather
> than have them generate an out-of-line function call that I couldn't easily
> inspect. I'm fairly sure that the out-of-line functions also have the relevant
> barriers.

Well, I learned something new about MSVC (again) today.  Thanks Andres
for pointing that out.  I fiddled with godbolt a bit and indeed on ARM64
systems the only thing that flag (/arch:armv9.4) does is to replace the
out-of-line function call for the intrinsic with the inlined version of it.

Without:
        bl          _InterlockedCompareExchange

With:
        mov         w10,#2
        str         w8,[sp]
        mov         x9,sp
        casal       w8,w10,[x9]

Good to know, and yes it does seem that the ASM includes the correct
instruction sequence.

I think, except for the build issues, the one thing I can put my finger
on as a "fix" is the change from _ReadWriteBarrier() to _dmb(_ARM64_BARRIER_SY).

The docs say:
The _ReadWriteBarrier intrinsic limits the compiler optimizations that
can remove or reorder memory accesses across the point of the call.

Note that it says "compiler optimization" not "system memory barrier". 
So, this macro change:

 #define S_UNLOCK(lock)    \
-    do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
+    do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
 
Makes more sense.  This change replaces a compiler-only barrier with a
full system memory barrier, fundamentally strengthening memory ordering
guarantees for spinlock release on ARM64.

When I remove the compiler flag from the patch and keep the S_UNLOCK()
change the tests pass.  I think we've found the critical fix.  I'll
update the patch set.

>> @@ -2509,7 +2513,10 @@ int main(void)
>>  }
>>  '''
>>  
>> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
>> __crc32cd without -march=armv8-a+crc',
>> +  if cc.get_id() == 'msvc'
>> +    cdata.set('USE_ARMV8_CRC32C', 1)
>> +    have_optimized_crc = true
> 
> Should have a comment explaining why we can do this unconditionally...

Sure, the more I look at this the more I want to clean it up a bit more.

> Greetings,
> 
> Andres Freund

Thanks again for the helpful nudge Andres, best.

-greg



On Nov 23 2025, at 3:32 pm, Thomas Munro <thomas.munro@gmail.com> wrote:

> On Mon, Nov 24, 2025 at 4:55 AM Andres Freund <andres@anarazel.de> wrote:
>> > -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
>> __crc32cd without -march=armv8-a+crc',
>> > +  if cc.get_id() == 'msvc'
>> > +    cdata.set('USE_ARMV8_CRC32C', 1)
>> > +    have_optimized_crc = true
>>
>> Should have a comment explaining why we can do this unconditionally...
>
> I was wondering about that in light of your revelation that it must be
> using /arch:armv8.0 by default.  If we only support Windows/ARM on
> armv8.1 hardware (like Windows 11 itself), then I think that means
> that we'd want /arch:armv8.1 here:
>
> +  # Add ARM64 architecture flag for Windows 11 ARM64 for correct intrensics
> +  if host_machine.system() == 'windows' and host_machine.cpu_family()
> == 'aarch64'
> +    add_project_arguments('/arch:armv9.4', language: ['c', 'cpp'])
> +  endif

Hey Thomas,

Turns out that flag isn't required at all, the issue was the S_UNLOCK()
implementation's use of a compiler barrier not a CPU memory barrier.

> We can't impose our own random high ISA requirement like that, or some
> machines will choke on illegal instructions.

New incoming patch drops this entirely, so no need to worry about it.

> I wonder if the same applies to Visual Studio on x86.  The OS now
> requires x86-64-v2 (like RHEL9, and many other Linux distros except
> Debian?), but Visual Studio might not know that... but then we might
> say that's an issue for the EDB packaging crew to think about, not us.
>
> In that case we might want to say "OK you choose the ARM version, but
> we're not going to write the runtime test for CRC on armv8, we'll do a
> compile-time test only, because it would be stupid to waste time
> writing code for armv8.0".

best.

-greg



Hello,

v3 attached is a more concise single patch that adds support for MSVC on Win11 ARM64.  The issues fixed are mostly
relatedto config, build, docs, an implementation of spin_delay() and a change to S_UNLOCK() macro.  I've squished the
workwe did into a single commit and improved the commit message.  Thomas, I think in another thread you mentioned that
someof what Dave added came from you, correct?  We should add you as another author if so, right?
 

best.

-greg

Вложения
On Mon, Nov 24, 2025, at 10:04 AM, Greg Burd wrote:
> Hello,
>
> v3 attached is a more concise single patch that adds support for MSVC 
> on Win11 ARM64.  The issues fixed are mostly related to config, build, 
> docs, an implementation of spin_delay() and a change to S_UNLOCK() 
> macro.  I've squished the work we did into a single commit and improved 
> the commit message.  Thomas, I think in another thread you mentioned 
> that some of what Dave added came from you, correct?  We should add you 
> as another author if so, right?
>
> best.
>
> -greg
>
> Attachments:
> * v3-0001-Enable-PostgreSQL-build-on-Windows-11-ARM64-with-.patch

I forgot to account for the non-ARM64 implementation of S_UNLOCK(), thanks CI for pointing that out.

-greg
Вложения
Hi,

On 2025-11-24 11:28:28 -0500, Greg Burd wrote:
> @@ -2509,25 +2513,64 @@ int main(void)
>  }
>  '''
>  
> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> -      args: test_c_args)
> -    # Use ARM CRC Extension unconditionally
> -    cdata.set('USE_ARMV8_CRC32C', 1)
> -    have_optimized_crc = true
> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
> -      args: test_c_args + ['-march=armv8-a+crc+simd'])
> -    # Use ARM CRC Extension, with runtime check
> -    cflags_crc += '-march=armv8-a+crc+simd'
> -    cdata.set('USE_ARMV8_CRC32C', false)
> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> -    have_optimized_crc = true
> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
> -      args: test_c_args + ['-march=armv8-a+crc'])
> -    # Use ARM CRC Extension, with runtime check
> -    cflags_crc += '-march=armv8-a+crc'
> -    cdata.set('USE_ARMV8_CRC32C', false)
> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> -    have_optimized_crc = true
> +  if cc.get_id() == 'msvc'
> +    # MSVC: Intrinsic availability check for ARM64
> +    if host_machine.cpu_family() == 'aarch64'
> +      # Test if CRC32C intrinsics are available in intrin.h
> +      crc32c_test_msvc = '''
> +        #include <intrin.h>
> +        int main(void) {
> +          uint32_t crc = 0;
> +          uint8_t data = 0;
> +          crc = __crc32cb(crc, data);
> +          return 0;
> +        }
> +      '''
> +      if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
> +        cdata.set('USE_ARMV8_CRC32C', 1)
> +        have_optimized_crc = true
> +        message('Using ARM64 CRC32C hardware acceleration (MSVC)')
> +      else
> +        message('CRC32C intrinsics not available on this MSVC ARM64 build')
> +      endif

Does this:
a) need to be conditional at all, given that it's msvc specific, it seems we
   don't need to run a test?
b) why is the msvc block outside of the general aarch64 block but then has
another nested aarch64 test inside? That seems unnecessarily complicated and
requires reindenting unnecessarily much code?


> +/*
> + * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
> + */
> +#ifdef _M_ARM64
> +
> +static __forceinline void
> +spin_delay(void)
> +{
> +     /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
> +    __isb(_ARM64_BARRIER_SY);
> +}
> +#else
> +/*
> + * For x64, use _mm_pause intrinsic instead of rep nop.
> + */
>  static __forceinline void
>  spin_delay(void)
>  {
>      _mm_pause();
>  }

This continues to use a barrier, with a reference to a list of barrier
semantics that really don't seem to make a whole lot of sense in the context
of spin_delay(). If we want to emit this kind of barrier for now it's ok with
me, but it should be documented as just being a fairly random choice, rather
than a link that doesn't explain anything.


> +#endif
>  #else
>  static __forceinline void
>  spin_delay(void)
> @@ -623,9 +640,13 @@ spin_delay(void)
>  #include <intrin.h>
>  #pragma intrinsic(_ReadWriteBarrier)
>  
> -#define S_UNLOCK(lock)    \
> +#ifdef _M_ARM64
> +#define S_UNLOCK(lock) \
> +    do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
> +#else

This doesn't seem like the right way to implement this - why not use
InterlockedExchange(lock, 0)? That will do the write with barrier semantics.

Greetings,

Andres Freund



On Mon, Nov 24, 2025, at 6:20 PM, Andres Freund wrote:
> Hi,

Thanks again for taking a look at the patch, hopefully I got it right this time. :)

> On 2025-11-24 11:28:28 -0500, Greg Burd wrote:
>> @@ -2509,25 +2513,64 @@ int main(void)
>>  }
>>  '''
>>  
>> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
>> -      args: test_c_args)
>> -    # Use ARM CRC Extension unconditionally
>> -    cdata.set('USE_ARMV8_CRC32C', 1)
>> -    have_optimized_crc = true
>> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
>> -      args: test_c_args + ['-march=armv8-a+crc+simd'])
>> -    # Use ARM CRC Extension, with runtime check
>> -    cflags_crc += '-march=armv8-a+crc+simd'
>> -    cdata.set('USE_ARMV8_CRC32C', false)
>> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>> -    have_optimized_crc = true
>> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
>> -      args: test_c_args + ['-march=armv8-a+crc'])
>> -    # Use ARM CRC Extension, with runtime check
>> -    cflags_crc += '-march=armv8-a+crc'
>> -    cdata.set('USE_ARMV8_CRC32C', false)
>> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>> -    have_optimized_crc = true
>> +  if cc.get_id() == 'msvc'
>> +    # MSVC: Intrinsic availability check for ARM64
>> +    if host_machine.cpu_family() == 'aarch64'
>> +      # Test if CRC32C intrinsics are available in intrin.h
>> +      crc32c_test_msvc = '''
>> +        #include <intrin.h>
>> +        int main(void) {
>> +          uint32_t crc = 0;
>> +          uint8_t data = 0;
>> +          crc = __crc32cb(crc, data);
>> +          return 0;
>> +        }
>> +      '''
>> +      if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
>> +        cdata.set('USE_ARMV8_CRC32C', 1)
>> +        have_optimized_crc = true
>> +        message('Using ARM64 CRC32C hardware acceleration (MSVC)')
>> +      else
>> +        message('CRC32C intrinsics not available on this MSVC ARM64 build')
>> +      endif
>
> Does this:
> a) need to be conditional at all, given that it's msvc specific, it seems we
>    don't need to run a test?
> b) why is the msvc block outside of the general aarch64 block but then has
> another nested aarch64 test inside? That seems unnecessarily complicated and
> requires reindenting unnecessarily much code?

Yep, I rushed this.  Apologies.  I've re-worked it with your suggestions.

>> +/*
>> + * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
>> + */
>> +#ifdef _M_ARM64
>> +
>> +static __forceinline void
>> +spin_delay(void)
>> +{
>> +     /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
>> +    __isb(_ARM64_BARRIER_SY);
>> +}
>> +#else
>> +/*
>> + * For x64, use _mm_pause intrinsic instead of rep nop.
>> + */
>>  static __forceinline void
>>  spin_delay(void)
>>  {
>>      _mm_pause();
>>  }
>
> This continues to use a barrier, with a reference to a list of barrier
> semantics that really don't seem to make a whole lot of sense in the context
> of spin_delay(). If we want to emit this kind of barrier for now it's ok with
> me, but it should be documented as just being a fairly random choice, rather
> than a link that doesn't explain anything.

I did more digging and found that you were right about the use of ISB for spin_delay().  I think I was misled by
earliercode in that file (lines 277-286) where there is an implementation of spin_delay() that uses ISB, I ran with
thatnot doing enough research myself.  So I did more digging and found an article on this [1] and it seems that YIELD
shouldbe used, not ISB.  I checked into how others implement this feature, Java [2][3] uses YIELD, Linux [4][5] uses
YIELDin cpu_relax() called by __delay().
 

>> +#endif
>>  #else
>>  static __forceinline void
>>  spin_delay(void)
>> @@ -623,9 +640,13 @@ spin_delay(void)
>>  #include <intrin.h>
>>  #pragma intrinsic(_ReadWriteBarrier)
>>  
>> -#define S_UNLOCK(lock)    \
>> +#ifdef _M_ARM64
>> +#define S_UNLOCK(lock) \
>> +    do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
>> +#else
>
> This doesn't seem like the right way to implement this - why not use
> InterlockedExchange(lock, 0)? That will do the write with barrier semantics.

Great idea, done.  Seems to work too.

> Greetings,
>
> Andres Freund

Given what I learned about YIELD vs ISB for spin delay it seems like a reasonable idea to submit a new patch for the
non-MSVCpath and switch it to YIELD, what do you think?
 

v5 attached, best.

-greg

[1]
https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm

[2] https://cr.openjdk.org/~dchuyko/8186670/yield/spinwait.html
[3] https://mail.openjdk.org/pipermail/aarch64-port-dev/2017-August/004880.html
[4]
https://github.com/torvalds/linux/blob/ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d/arch/arm64/include/asm/vdso/processor.h
[5] https://github.com/torvalds/linux/commit/f511e079177a9b97175a9a3b0ee2374d55682403
Вложения
Hello.

Rebased with only minor changes to meson.build this patch is ready for review/commit as it is passing tests on my
aarch64Win11 MSVC system.  Also note that this system I'm testing on is ready to become a member of the buildfarm
(applicationsubmitted) and monitor this combo in perpetuity.
 

FWIW, Dave's bug report to Microsoft [1] seems to be a duplicate of [2] and the duplicate is about to be auto-closed.

best.

-greg

[1] https://developercommunity.visualstudio.com/t/MSVCs-_InterlockedCompareExchange-doe/11004239
[2] https://developercommunity.visualstudio.com/t/MSVC-compiler-optimization-in-PostgreSQL/10982137
Вложения
On Thu, Dec 11, 2025 at 5:32 AM Greg Burd <greg@burd.me> wrote:
> Rebased with only minor changes to meson.build this patch is ready for review/commit as it is passing tests on my
aarch64Win11 MSVC system.  Also note that this system I'm testing on is ready to become a member of the buildfarm
(applicationsubmitted) and monitor this combo in perpetuity. 

-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
__crc32cd without -march=armv8-a+crc',
...
+  if host_machine.cpu_family() == 'aarch64'

I think this new nesting of the CRC32 feature tests breaks the test on
"armv7" distros (in our build farm, that's a bunch of RPis running
Debian/Raspbian, but at least FreeBSD and NetBSD also support
"armv7").  Any ARM chip made since around 2011 is really an ARMv8+
chip running Aarch32 code and can thus reach the ARMv8 instructions.
For example "grison" says:

checking build system type... (cached) armv7l-unknown-linux-gnueabihf
...
checking which CRC-32C implementation to use... ARMv8 CRC instructions
with runtime check

-#define S_UNLOCK(lock)    \
+#define S_UNLOCK(lock) \

Bogus whitespace change.



On Wed, Dec 10, 2025, at 4:31 PM, Thomas Munro wrote:
> On Thu, Dec 11, 2025 at 5:32 AM Greg Burd <greg@burd.me> wrote:
>> Rebased with only minor changes to meson.build this patch is ready for review/commit as it is passing tests on my
aarch64Win11 MSVC system.  Also note that this system I'm testing on is ready to become a member of the buildfarm
(applicationsubmitted) and monitor this combo in perpetuity. 
>
> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
> __crc32cd without -march=armv8-a+crc',
> ...
> +  if host_machine.cpu_family() == 'aarch64'

Thanks for taking the time to review.

> I think this new nesting of the CRC32 feature tests breaks the test on
> "armv7" distros (in our build farm, that's a bunch of RPis running
> Debian/Raspbian, but at least FreeBSD and NetBSD also support
> "armv7").  Any ARM chip made since around 2011 is really an ARMv8+
> chip running Aarch32 code and can thus reach the ARMv8 instructions.
> For example "grison" says:
>
> checking build system type... (cached) armv7l-unknown-linux-gnueabihf
> ...
> checking which CRC-32C implementation to use... ARMv8 CRC instructions
> with runtime check

That was rather silly of me to overlook, apologies.  I went back to the code that was there before written by Andres I
believeand simply added a test for this one special case.  That reduces code churn and simplifies the logic while
preservingthe earlier behavior. 

As I was re-reading I decided to review the YEILD vs ISB question for this platform combo again in light of a82a5ee
[1]. What I found was a wealth of information and work on the topic and the thread mentioned MariaDB so I did some
digging(because, why not?).  Go-lang[2] uses ISB but a comment there led me on to another resource, a blog post from
ARMon the topic of spin locks for ARM64[3].  I found that like Go-lang, Rust[4] uses ISB (and if you read only one
threadon the topic this is the one to dig into).  The developers on the thread wrote a number of benchmarks and tested
avariety of approaches landing on "ISB SY" on aarch64.  I found that MySQL[5][6] also went with ISB, as has
Cloudera[7],Impala[8], Aerospike[9], and WebAssembly[10] at which point I stopped looking around.  There were other
resourcesout there for understanding ARM and relaxed memory models such as [11], [12], and [13] for those that care to
learn.

All of this is to say that I swapped out the YEILD for "ISB SY" for the ARM64/MSVC because that seems to make the most
sensefrom the evidence I found and for consistency in our code. 

> -#define S_UNLOCK(lock)    \
> +#define S_UNLOCK(lock) \
>
> Bogus whitespace change.

Fixed, thanks.

best.

-greg

[1] https://postgr.es/m/78338F29-9D7F-4DC8-BD71-E9674CE71425@amazon.com
[2] https://github.com/golang/go/issues/69232
[3]
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
[4] https://github.com/rust-lang/rust/commit/c064b6560b7ce0adeb9bbf5d7dcf12b1acb0c807
[5] https://bugs.mysql.com/bug.php?id=100664
[6] https://github.com/mysql/mysql-server/pull/305/files
[7] https://gerrit.cloudera.org/#/c/19865/
[8] https://issues.apache.org/jira/browse/IMPALA-12122
[9] https://github.com/aerospike/aerospike-common/pull/17
[10] https://github.com/WebAssembly/threads/issues/15
[11] https://developer.arm.com/documentation/genc007826/latest "Barrier Litmus Tests and Cookbook"
[12] http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf - Memory Barriers: a Hardware View for
SoftwareHackers 
[13] https://randomascii.wordpress.com/2020/11/29/arm-and-lock-free-programming/

Вложения
Well, TIL something about Meson; you must escape multi-line statements with a backslash.  Apologies for the
oversight/noise.

best.

-greg

Вложения
+/*
+ * _InterlockedExchange() generates a full memory barrier (or release
+ * semantics that ensures all prior memory operations are visible to
+ * other cores before the lock is released.
+ */
+#define S_UNLOCK(lock) (InterlockedExchange(lock, 0))

This seems to change the implementation from

    #define S_UNLOCK(lock)    \
        do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)

in some cases, but I am insufficiently caffeinated to figure out what
platforms use which implementation.  In any case, it looks like we are
changing it for some currently-supported platforms, and I'm curious why.
Perhaps there's some way to make the #ifdefs a bit more readable, too
(e.g., a prerequisite patch that rearranges things).

The rest looks generally reasonable to me.

-- 
nathan



On Fri, Dec 12, 2025, at 11:03 AM, Nathan Bossart wrote:
> +/*
> + * _InterlockedExchange() generates a full memory barrier (or release
> + * semantics that ensures all prior memory operations are visible to
> + * other cores before the lock is released.
> + */
> +#define S_UNLOCK(lock) (InterlockedExchange(lock, 0))

Nathan, thanks for looking at the patch!

> This seems to change the implementation from
>
>     #define S_UNLOCK(lock)    \
>         do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
>
> in some cases, but I am insufficiently caffeinated to figure out what
> platforms use which implementation.  In any case, it looks like we are
> changing it for some currently-supported platforms, and I'm curious why.

This change is within _MSC_VER, but AFAICT this intrinsic is available across their supported platforms.  The previous
implementationof S_UNLOCK() boils down to a no-op because the _ReadWriteBarrier()[1] is a hint to the compiler and does
notemit any instruction on any platform and it's also deprecated.  So, on MSVC S_UNLOCK is an unguarded assignment and
thena loop that will be optimized out, not really what we wanted I'd imagine.  My tests with godbolt showed this to be
true,no instruction barriers emitted.  I think it was Andres[2] who suggested replacing it with
_InterlockedExchange()[3]. So, given that _ReadWriteBarrier() is deprecated I decided not to specialize this change to
onlythe ARM64 platform, sorry for not making that clear in the commit or email.
 

My buildfarm animal for this platform was just approved and is named "unicorn", I'm not making that up. :)

best.

-greg

> Perhaps there's some way to make the #ifdefs a bit more readable, too
> (e.g., a prerequisite patch that rearranges things).
>
> The rest looks generally reasonable to me.
>
> -- 
> nathan

[1] https://learn.microsoft.com/en-us/cpp/intrinsics/readwritebarrier
[2] https://www.postgresql.org/message-id/beirrgqo5n5e73dwa4dsdnlbtef3bsdv5sgarm6przdzxvifk5%40whyuhyemmhyr
[3] https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedexchange-intrinsic-functions



Hi,

On 2025-12-12 14:21:47 -0500, Greg Burd wrote:
> 
> On Fri, Dec 12, 2025, at 11:03 AM, Nathan Bossart wrote:
> > +/*
> > + * _InterlockedExchange() generates a full memory barrier (or release
> > + * semantics that ensures all prior memory operations are visible to
> > + * other cores before the lock is released.
> > + */
> > +#define S_UNLOCK(lock) (InterlockedExchange(lock, 0))
> 
> Nathan, thanks for looking at the patch!
> 
> > This seems to change the implementation from
> >
> >     #define S_UNLOCK(lock)    \
> >         do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
> >
> > in some cases, but I am insufficiently caffeinated to figure out what
> > platforms use which implementation.  In any case, it looks like we are
> > changing it for some currently-supported platforms, and I'm curious why.
> 
> This change is within _MSC_VER, but AFAICT this intrinsic is available
> across their supported platforms.  The previous implementation of S_UNLOCK()
> boils down to a no-op because the _ReadWriteBarrier()[1] is a hint to the
> compiler and does not emit any instruction on any platform and it's also
> deprecated.  So, on MSVC S_UNLOCK is an unguarded assignment and then a loop
> that will be optimized out, not really what we wanted I'd imagine.

I don't think it can be optimized out, that should be prevented by
_ReadWriteBarrier() being a compiler barrier.


> My tests with godbolt showed this to be true, no instruction barriers
> emitted.  I think it was Andres[2] who suggested replacing it with
> _InterlockedExchange()[3].  So, given that _ReadWriteBarrier() is deprecated
> I decided not to specialize this change to only the ARM64 platform, sorry
> for not making that clear in the commit or email.

I don't think that's a good idea - the _ReadWriteBarrier() is sufficient on
x86 to implement a spinlock release (due to x86 being a total store order
architecture, once the lock is observed as being released, all the effects
protected by the lock are also guaranteed to be visible).  Making the
spinlocks use an atomic op for both acquire and release does cause measurable
slowdowns on x86 with gcc, so I'd expect the same to be true on windows.

Greetings,

Andres Freund



On Fri, Dec 12, 2025, at 2:32 PM, Andres Freund wrote:
> Hi,
>
> On 2025-12-12 14:21:47 -0500, Greg Burd wrote:
>> 
>> On Fri, Dec 12, 2025, at 11:03 AM, Nathan Bossart wrote:
>> > +/*
>> > + * _InterlockedExchange() generates a full memory barrier (or release
>> > + * semantics that ensures all prior memory operations are visible to
>> > + * other cores before the lock is released.
>> > + */
>> > +#define S_UNLOCK(lock) (InterlockedExchange(lock, 0))
>> 
>> Nathan, thanks for looking at the patch!
>> 
>> > This seems to change the implementation from
>> >
>> >     #define S_UNLOCK(lock)    \
>> >         do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
>> >
>> > in some cases, but I am insufficiently caffeinated to figure out what
>> > platforms use which implementation.  In any case, it looks like we are
>> > changing it for some currently-supported platforms, and I'm curious why.
>> 
>> This change is within _MSC_VER, but AFAICT this intrinsic is available
>> across their supported platforms.  The previous implementation of S_UNLOCK()
>> boils down to a no-op because the _ReadWriteBarrier()[1] is a hint to the
>> compiler and does not emit any instruction on any platform and it's also
>> deprecated.  So, on MSVC S_UNLOCK is an unguarded assignment and then a loop
>> that will be optimized out, not really what we wanted I'd imagine.

Thanks Andres for the comments.

> I don't think it can be optimized out, that should be prevented by
> _ReadWriteBarrier() being a compiler barrier.

While the documentation does mention that this has been deprecated, I tend to agree with you.  This has been in place
fora while, no need to change what works at this time.  A new thread might be a better forum if there is some evidence
thatwe should revisit this in the future.  I'd like to land the ARM64/MSVC changes and enable that platform in this
thread.

>> My tests with godbolt showed this to be true, no instruction barriers
>> emitted.  I think it was Andres[2] who suggested replacing it with
>> _InterlockedExchange()[3].  So, given that _ReadWriteBarrier() is deprecated
>> I decided not to specialize this change to only the ARM64 platform, sorry
>> for not making that clear in the commit or email.
>
> I don't think that's a good idea - the _ReadWriteBarrier() is sufficient on
> x86 to implement a spinlock release (due to x86 being a total store order
> architecture, once the lock is observed as being released, all the effects
> protected by the lock are also guaranteed to be visible).  Making the
> spinlocks use an atomic op for both acquire and release does cause measurable
> slowdowns on x86 with gcc, so I'd expect the same to be true on windows.

Got it, fixed in v9.

best.

-greg

> Greetings,
>
> Andres Freund
Вложения
On Mon, Dec 15, 2025 at 12:27:25PM -0500, Greg Burd wrote:
> Got it, fixed in v9.

I tried to rearrange the s_lock.h changes to make it more obvious what is
specific to AArch64.  WDYT?

-- 
nathan

Вложения
Hi,

On 2025-12-15 15:38:49 -0600, Nathan Bossart wrote:
> +++ b/meson.build
> @@ -2523,7 +2523,8 @@ int main(void)
>  }
>  '''
>  
> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> +  if (host_cpu == 'aarch64' and cc.get_id() == 'msvc') or \
> +        cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
>        args: test_c_args)
>      # Use ARM CRC Extension unconditionally
>      cdata.set('USE_ARMV8_CRC32C', 1)

I still think this should have a comment explaining that we can
unconditionally rely on crc32 support due to window's baseline requirements.


> diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
> index 7f8f566bd40..e62141abf0a 100644
> --- a/src/include/storage/s_lock.h
> +++ b/src/include/storage/s_lock.h
> @@ -602,13 +602,21 @@ typedef LONG slock_t;
>  
>  #define SPIN_DELAY() spin_delay()
>  
> -/* If using Visual C++ on Win64, inline assembly is unavailable.
> - * Use a _mm_pause intrinsic instead of rep nop.
> - */
> -#if defined(_WIN64)
> +#ifdef _M_ARM64
> +static __forceinline void
> +spin_delay(void)
> +{
> +    /* Research indicates ISB is better than __yield() on AArch64. */
> +    __isb(_ARM64_BARRIER_SY);

It'd be good to link the research in some form or another. Otherwise it's
harder to evolve the code in the future, because we don't know if the research
was "I liked the color better" or "one is catastrophically slower than the
other".

> +}
> +#elif defined(_WIN64)
>  static __forceinline void
>  spin_delay(void)
>  {
> +    /*
> +     * If using Visual C++ on Win64, inline assembly is unavailable.
> +     * Use a _mm_pause intrinsic instead of rep nop.
> +     */
>      _mm_pause();
>  }
>  #else
> @@ -621,12 +629,19 @@ spin_delay(void)
>  #endif
>  
>  #include <intrin.h>
> +#ifdef _M_ARM64
> +#pragma intrinsic(_InterlockedExchange)
> +
> +/* _ReadWriteBarrier() is insufficient on non-TSO architectures. */
> +#define S_UNLOCK(lock) _InterlockedExchange(lock, 0)
> +#else
>  #pragma intrinsic(_ReadWriteBarrier)
>  
>  #define S_UNLOCK(lock)    \
>      do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
>  
>  #endif
> +#endif

The newline placement looks odd here. I'd add newlines around the #else.



Greetings,

Andres Freund



On Mon, Dec 15, 2025 at 05:32:36PM -0500, Andres Freund wrote:
> On 2025-12-15 15:38:49 -0600, Nathan Bossart wrote:
>> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
>> +  if (host_cpu == 'aarch64' and cc.get_id() == 'msvc') or \
>> +        cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
>>        args: test_c_args)
>>      # Use ARM CRC Extension unconditionally
>>      cdata.set('USE_ARMV8_CRC32C', 1)
> 
> I still think this should have a comment explaining that we can
> unconditionally rely on crc32 support due to window's baseline requirements.

Done.

>> +#ifdef _M_ARM64
>> +static __forceinline void
>> +spin_delay(void)
>> +{
>> +    /* Research indicates ISB is better than __yield() on AArch64. */
>> +    __isb(_ARM64_BARRIER_SY);
> 
> It'd be good to link the research in some form or another. Otherwise it's
> harder to evolve the code in the future, because we don't know if the research
> was "I liked the color better" or "one is catastrophically slower than the
> other".

Done.

>>  #include <intrin.h>
>> +#ifdef _M_ARM64
>> +#pragma intrinsic(_InterlockedExchange)
>> +
>> +/* _ReadWriteBarrier() is insufficient on non-TSO architectures. */
>> +#define S_UNLOCK(lock) _InterlockedExchange(lock, 0)
>> +#else
>>  #pragma intrinsic(_ReadWriteBarrier)
>>  
>>  #define S_UNLOCK(lock)    \
>>      do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
>>  
>>  #endif
>> +#endif
> 
> The newline placement looks odd here. I'd add newlines around the #else.

Done.

-- 
nathan

Вложения

On December 15, 2025 6:00:47 PM EST, Nathan Bossart <nathandbossart@gmail.com> wrote:
>Done.

LGTM
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.