Обсуждение: Popcount optimization for the slow-path lookups
I want to propose an optimization for pg_popcount32_slow() and pg_popcount64_slow() where lookups into pg_number_of_ones[] are made branchless. It shows speedup around 58% for uint64 and 35% for uint32 words compared to the current, looped version. This is on x86. It is much more significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32 words.
I probably have to guard the lookup in pg_popcount64_slow() with `#if SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same function. What do you think?
For benchmarks, I used functions with the copies of the lookup loop from pg_popcount32/64_slow(), measuring them against branchless counterparts. Then in C++ I pre generated two static vectors with random uint64's and uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've run it on M1 and x86 Macs.
Results:
X86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```
Apple M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```
I also tried the Parallel summing of adjacent bits (see Hacker's Delight [1], Chapter 5)
This is the code for uint64:
```
#define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
#define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
#define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...
static inline int
pg_popcount_word64(uint64 word)
{
word = word - ((word >> 1) & POPCOUNT_M0);
word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
word = ((word >> 4) + word) & POPCOUNT_M2;
word += word >> 8;
word += word >> 16;
word += word >> 32; // this step is omitted for the uint32 version
return word & 0x3F;
}
```
However, it doesn't show any improvements compared to the branchless look up on M1 and a slight gain for uint64 on x86.
The same tests with the summing of adjacent bits:
x86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_AdjacentBits 16.8 ns 16.7 ns 41481236
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_AdjacentBits 15.5 ns 15.4 ns 44454323
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```
M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_AdjacentBits 8.05 ns 8.05 ns 86988490
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_AdjacentBits 7.27 ns 7.27 ns 96050714
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```
[0] https://github.com/google/benchmark
[1] https://dl.acm.org/doi/book/10.5555/2462741
-----
Cheers,
Andy
Вложения
Hi Andy!
On 05.12.2025 14:07, Andrew Pogrebnoi wrote:
> Hello hackers,
>
> I want to propose an optimization for pg_popcount32_slow() and
> pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
> branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
> compared to the current, looped version. This is on x86. It is much more
> significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32
> words.
>
> I probably have to guard the lookup in pg_popcount64_slow() with `#if
> SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same
> function. What do you think?
>
> For benchmarks, I used functions with the copies of the lookup loop from
> pg_popcount32/64_slow(), measuring them against branchless counterparts.
> Then in C++ I pre generated two static vectors with random uint64's and
> uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've
> run it on M1 and x86 Macs.
>
> Results:
>
> X86
> ```
> % g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
> -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
> 2025-12-05T10:13:34+02:00
> Running ./mybenchmark
> Run on (12 X 3200 MHz CPU s)
> CPU Caches:
> L1 Data 32 KiB
> L1 Instruction 32 KiB
> L2 Unified 256 KiB (x6)
> L3 Unified 12288 KiB
> Load Average: 3.23, 4.56, 2.92
> --------------------------------------------------------------------
> Benchmark Time CPU Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
> Popcnt64_PG 27.2 ns 27.0 ns 25415636
> Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
> Popcnt32_PG 19.5 ns 19.4 ns 36125676
> ```
>
> Apple M1
> ```
> $ g++ popcnt.cc -std=c++11 -isystem benchmark/include
> -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
> ./mybenchmark
> 2025-12-05T08:59:12+00:00
> Running ./mybenchmark
> Run on (10 X 48 MHz CPU s)
> CPU Caches:
> L1 Data 128 KiB (x10)
> L1 Instruction 192 KiB (x10)
> L2 Unified 12288 KiB (x10)
> Load Average: 0.28, 0.07, 0.02
> --------------------------------------------------------------------
> Benchmark Time CPU Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
> Popcnt64_PG 17.1 ns 17.1 ns 43015334
> Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
> Popcnt32_PG 9.79 ns 9.79 ns 75893374
> ```
>
>
> I also tried the Parallel summing of adjacent bits (see Hacker's Delight
> [1], Chapter 5)
> This is the code for uint64:
> ```
> #define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
> #define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
> #define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...
>
> static inline int
> pg_popcount_word64(uint64 word)
> {
> word = word - ((word >> 1) & POPCOUNT_M0);
> word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
> word = ((word >> 4) + word) & POPCOUNT_M2;
> word += word >> 8;
> word += word >> 16;
> word += word >> 32; // this step is omitted for the uint32 version
>
> return word & 0x3F;
> }
> ```
>
> However, it doesn't show any improvements compared to the branchless look
> up on M1 and a slight gain for uint64 on x86.
>
> The same tests with the summing of adjacent bits:
> x86
> ```
> % g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
> -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
> 2025-12-05T10:13:34+02:00
> Running ./mybenchmark
> Run on (12 X 3200 MHz CPU s)
> CPU Caches:
> L1 Data 32 KiB
> L1 Instruction 32 KiB
> L2 Unified 256 KiB (x6)
> L3 Unified 12288 KiB
> Load Average: 3.23, 4.56, 2.92
> --------------------------------------------------------------------
> Benchmark Time CPU Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
> Popcnt64_AdjacentBits 16.8 ns 16.7 ns 41481236
> Popcnt64_PG 27.2 ns 27.0 ns 25415636
> Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
> Popcnt32_AdjacentBits 15.5 ns 15.4 ns 44454323
> Popcnt32_PG 19.5 ns 19.4 ns 36125676
> ```
>
> M1
> ```
> $ g++ popcnt.cc -std=c++11 -isystem benchmark/include
> -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
> ./mybenchmark
> 2025-12-05T08:59:12+00:00
> Running ./mybenchmark
> Run on (10 X 48 MHz CPU s)
> CPU Caches:
> L1 Data 128 KiB (x10)
> L1 Instruction 192 KiB (x10)
> L2 Unified 12288 KiB (x10)
> Load Average: 0.28, 0.07, 0.02
> --------------------------------------------------------------------
> Benchmark Time CPU Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
> Popcnt64_AdjacentBits 8.05 ns 8.05 ns 86988490
> Popcnt64_PG 17.1 ns 17.1 ns 43015334
> Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
> Popcnt32_AdjacentBits 7.27 ns 7.27 ns 96050714
> Popcnt32_PG 9.79 ns 9.79 ns 75893374
> ```
>
>
>
> [0] https://github.com/google/benchmark
> [1] https://dl.acm.org/doi/book/10.5555/2462741
>
> -----
> Cheers,
> Andy
>
I would like to test if I can reproduce your results. Could you share
your test program?
You also don't specify an optimization level. That means the default
level -O0 is used. Does it make it a difference if you run e.g. with -O3?
--
David Geier
> your test program?
Here you go: https://gist.github.com/dAdAbird/1480ff15764f5a6301174806d8512a3a
> You also don't specify an optimization level. That means the default
> level -O0 is used. Does it make it a difference if you run e.g. with -O3?
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark
2025-12-05T16:22:43+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 2.67, 4.72, 2.90
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 5.92 ns 5.90 ns 113871130
Popcnt64_AdjacentBits 5.30 ns 5.27 ns 115084258
Popcnt64_PG 8.34 ns 8.32 ns 80622869
Popcnt32_BranchlessLookup 3.62 ns 3.61 ns 197816110
Popcnt32_AdjacentBits 4.56 ns 4.55 ns 154932383
Popcnt32_PG 5.32 ns 5.31 ns 121845083
2025-12-05T14:29:18+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.01, 0.01, 0.00
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 2.13 ns 2.13 ns 342101937
Popcnt64_AdjacentBits 1.89 ns 1.89 ns 370316571
Popcnt64_PG 3.83 ns 3.83 ns 190990088
Popcnt32_BranchlessLookup 1.19 ns 1.19 ns 591797267
Popcnt32_AdjacentBits 1.73 ns 1.73 ns 409381266
Popcnt32_PG 2.01 ns 2.01 ns 344969625
On Fri, Dec 05, 2025 at 03:07:07PM +0200, Andrew Pogrebnoi wrote: > I want to propose an optimization for pg_popcount32_slow() and > pg_popcount64_slow() where lookups into pg_number_of_ones[] are made > branchless. It shows speedup around 58% for uint64 and 35% for uint32 words > compared to the current, looped version. This is on x86. It is much more > significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32 > words. > > I probably have to guard the lookup in pg_popcount64_slow() with `#if > SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same > function. What do you think? > > For benchmarks, I used functions with the copies of the lookup loop from > pg_popcount32/64_slow(), measuring them against branchless counterparts. > Then in C++ I pre generated two static vectors with random uint64's and > uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've > run it on M1 and x86 Macs. I don't think the proposed improvements are relevant for either of the machines you used for your benchmarks. For x86, we've optimized our popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it to use Neon or SVE. And for other systems, we still try to use __builtin_popcount() and friends in the fallback paths, which IIUC are available on both gcc and clang (and maybe elsewhere). IMHO we need to run the benchmarks on a compiler/architecture combination where it would actually be used in practice. -- nathan
> machines you used for your benchmarks. For x86, we've optimized our
> popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it
> to use Neon or SVE. And for other systems, we still try to use
> __builtin_popcount() and friends in the fallback paths, which IIUC are
> available on both gcc and clang (and maybe elsewhere). IMHO we need to run
> the benchmarks on a compiler/architecture combination where it would
> actually be used in practice.
I reasoned that the code path still exists, and eliminating branching there would be beneficial anyway
(most probably). But you are right, we need to test it on target architectures/compilers. I'll try to do with that.
On Fri, Dec 5, 2025 at 10:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I don't think the proposed improvements are relevant for either of the > machines you used for your benchmarks. For x86, we've optimized our > popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it > to use Neon or SVE. And for other systems, we still try to use > __builtin_popcount() and friends in the fallback paths, which IIUC are > available on both gcc and clang (and maybe elsewhere). IMHO we need to run > the benchmarks on a compiler/architecture combination where it would > actually be used in practice. Yeah, if we did anything here, I'd rather arrange so that architectures that have unconditional hardware support can inline it at compile time. I believe ppc64le and aarch64 can do that unconditionally. For x86 we might be able to detect some symbol defined by the compiler, to do the same thing for OS's that require such support. -- John Naylor Amazon Web Services