Обсуждение: Popcount optimization for the slow-path lookups

Поиск
Список
Период
Сортировка

Popcount optimization for the slow-path lookups

От
Andrew Pogrebnoi
Дата:
Hello hackers,

I want to propose an optimization for pg_popcount32_slow() and pg_popcount64_slow() where lookups into pg_number_of_ones[] are made branchless. It shows speedup around 58% for uint64 and 35% for uint32 words compared to the current, looped version. This is on x86. It is much more significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32 words.

I probably have to guard the lookup in pg_popcount64_slow() with `#if SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same function. What do you think?

For benchmarks, I used functions with the copies of the lookup loop from pg_popcount32/64_slow(), measuring them against branchless counterparts. Then in C++ I pre generated two static vectors with random uint64's and uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've run it on M1 and x86 Macs.

Results:

X86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
Popcnt64_PG                     27.2 ns         27.0 ns     25415636
Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
Popcnt32_PG                     19.5 ns         19.4 ns     36125676
```

Apple M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include   -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB (x10)
  L1 Instruction 192 KiB (x10)
  L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
Popcnt64_PG                     17.1 ns         17.1 ns     43015334
Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
Popcnt32_PG                     9.79 ns         9.79 ns     75893374
```


I also tried the Parallel summing of adjacent bits (see Hacker's Delight [1], Chapter 5)
This is the code for uint64:
```
#define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
#define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
#define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...

static inline int
pg_popcount_word64(uint64 word)
{
    word = word - ((word >> 1) & POPCOUNT_M0);
    word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
    word = ((word >> 4) + word) & POPCOUNT_M2;
    word += word >> 8;
    word += word >> 16;
    word += word >> 32; // this step is omitted for the uint32 version

    return word & 0x3F;
}
```

However, it doesn't show any improvements compared to the branchless look up on M1 and a slight gain for uint64 on x86.

The same tests with the summing of adjacent bits:
x86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
Popcnt64_AdjacentBits           16.8 ns         16.7 ns     41481236
Popcnt64_PG                     27.2 ns         27.0 ns     25415636
Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
Popcnt32_AdjacentBits           15.5 ns         15.4 ns     44454323
Popcnt32_PG                     19.5 ns         19.4 ns     36125676
```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include   -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB (x10)
  L1 Instruction 192 KiB (x10)
  L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
Popcnt64_AdjacentBits           8.05 ns         8.05 ns     86988490
Popcnt64_PG                     17.1 ns         17.1 ns     43015334
Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
Popcnt32_AdjacentBits           7.27 ns         7.27 ns     96050714
Popcnt32_PG                     9.79 ns         9.79 ns     75893374
```



[0] https://github.com/google/benchmark
[1] https://dl.acm.org/doi/book/10.5555/2462741

-----
Cheers,
Andy
Вложения

Re: Popcount optimization for the slow-path lookups

От
David Geier
Дата:
Hi Andy!

On 05.12.2025 14:07, Andrew Pogrebnoi wrote:
> Hello hackers,
> 
> I want to propose an optimization for pg_popcount32_slow() and
> pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
> branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
> compared to the current, looped version. This is on x86. It is much more
> significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32
> words.
> 
> I probably have to guard the lookup in pg_popcount64_slow() with `#if
> SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same
> function. What do you think?
> 
> For benchmarks, I used functions with the copies of the lookup loop from
> pg_popcount32/64_slow(), measuring them against branchless counterparts.
> Then in C++ I pre generated two static vectors with random uint64's and
> uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've
> run it on M1 and x86 Macs.
> 
> Results:
> 
> X86
> ```
> % g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
> -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
> 2025-12-05T10:13:34+02:00
> Running ./mybenchmark
> Run on (12 X 3200 MHz CPU s)
> CPU Caches:
>   L1 Data 32 KiB
>   L1 Instruction 32 KiB
>   L2 Unified 256 KiB (x6)
>   L3 Unified 12288 KiB
> Load Average: 3.23, 4.56, 2.92
> --------------------------------------------------------------------
> Benchmark                          Time             CPU   Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
> Popcnt64_PG                     27.2 ns         27.0 ns     25415636
> Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
> Popcnt32_PG                     19.5 ns         19.4 ns     36125676
> ```
> 
> Apple M1
> ```
> $ g++ popcnt.cc -std=c++11 -isystem benchmark/include
> -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
> ./mybenchmark
> 2025-12-05T08:59:12+00:00
> Running ./mybenchmark
> Run on (10 X 48 MHz CPU s)
> CPU Caches:
>   L1 Data 128 KiB (x10)
>   L1 Instruction 192 KiB (x10)
>   L2 Unified 12288 KiB (x10)
> Load Average: 0.28, 0.07, 0.02
> --------------------------------------------------------------------
> Benchmark                          Time             CPU   Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
> Popcnt64_PG                     17.1 ns         17.1 ns     43015334
> Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
> Popcnt32_PG                     9.79 ns         9.79 ns     75893374
> ```
> 
> 
> I also tried the Parallel summing of adjacent bits (see Hacker's Delight
> [1], Chapter 5)
> This is the code for uint64:
> ```
> #define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
> #define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
> #define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...
> 
> static inline int
> pg_popcount_word64(uint64 word)
> {
>     word = word - ((word >> 1) & POPCOUNT_M0);
>     word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
>     word = ((word >> 4) + word) & POPCOUNT_M2;
>     word += word >> 8;
>     word += word >> 16;
>     word += word >> 32; // this step is omitted for the uint32 version
> 
>     return word & 0x3F;
> }
> ```
> 
> However, it doesn't show any improvements compared to the branchless look
> up on M1 and a slight gain for uint64 on x86.
> 
> The same tests with the summing of adjacent bits:
> x86
> ```
> % g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
> -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
> 2025-12-05T10:13:34+02:00
> Running ./mybenchmark
> Run on (12 X 3200 MHz CPU s)
> CPU Caches:
>   L1 Data 32 KiB
>   L1 Instruction 32 KiB
>   L2 Unified 256 KiB (x6)
>   L3 Unified 12288 KiB
> Load Average: 3.23, 4.56, 2.92
> --------------------------------------------------------------------
> Benchmark                          Time             CPU   Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
> Popcnt64_AdjacentBits           16.8 ns         16.7 ns     41481236
> Popcnt64_PG                     27.2 ns         27.0 ns     25415636
> Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
> Popcnt32_AdjacentBits           15.5 ns         15.4 ns     44454323
> Popcnt32_PG                     19.5 ns         19.4 ns     36125676
> ```
> 
> M1
> ```
> $ g++ popcnt.cc -std=c++11 -isystem benchmark/include
> -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
> ./mybenchmark
> 2025-12-05T08:59:12+00:00
> Running ./mybenchmark
> Run on (10 X 48 MHz CPU s)
> CPU Caches:
>   L1 Data 128 KiB (x10)
>   L1 Instruction 192 KiB (x10)
>   L2 Unified 12288 KiB (x10)
> Load Average: 0.28, 0.07, 0.02
> --------------------------------------------------------------------
> Benchmark                          Time             CPU   Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
> Popcnt64_AdjacentBits           8.05 ns         8.05 ns     86988490
> Popcnt64_PG                     17.1 ns         17.1 ns     43015334
> Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
> Popcnt32_AdjacentBits           7.27 ns         7.27 ns     96050714
> Popcnt32_PG                     9.79 ns         9.79 ns     75893374
> ```
> 
> 
> 
> [0] https://github.com/google/benchmark
> [1] https://dl.acm.org/doi/book/10.5555/2462741
> 
> -----
> Cheers,
> Andy
> 

I would like to test if I can reproduce your results. Could you share
your test program?

You also don't specify an optimization level. That means the default
level -O0 is used. Does it make it a difference if you run e.g. with -O3?

--
David Geier



Re: Popcount optimization for the slow-path lookups

От
Andrew Pogrebnoi
Дата:
Hi David,

Thanks for looking at it!

> I would like to test if I can reproduce your results. Could you share
> your test program?

Here you go: https://gist.github.com/dAdAbird/1480ff15764f5a6301174806d8512a3a

> You also don't specify an optimization level. That means the default
> level -O0 is used. Does it make it a difference if you run e.g. with -O3?

Yes, good point. I got carried away with experiments and totally forgot that..

I've re-run tests with -O3 and it seems like the Summing of Adjacent Bits might make sense for uint64:

X86
```

% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark

2025-12-05T16:22:43+02:00

Running ./mybenchmark

Run on (12 X 3200 MHz CPU s)

CPU Caches:

  L1 Data 32 KiB

  L1 Instruction 32 KiB

  L2 Unified 256 KiB (x6)

  L3 Unified 12288 KiB

Load Average: 2.67, 4.72, 2.90

--------------------------------------------------------------------

Benchmark                          Time             CPU   Iterations

--------------------------------------------------------------------

Popcnt64_BranchlessLookup       5.92 ns         5.90 ns    113871130

Popcnt64_AdjacentBits           5.30 ns         5.27 ns    115084258

Popcnt64_PG                     8.34 ns         8.32 ns     80622869

Popcnt32_BranchlessLookup       3.62 ns         3.61 ns    197816110

Popcnt32_AdjacentBits           4.56 ns         4.55 ns    154932383

Popcnt32_PG                     5.32 ns         5.31 ns    121845083

```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include   -L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark
2025-12-05T14:29:18+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB (x10)
  L1 Instruction 192 KiB (x10)
  L2 Unified 12288 KiB (x10)
Load Average: 0.01, 0.01, 0.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       2.13 ns         2.13 ns    342101937
Popcnt64_AdjacentBits           1.89 ns         1.89 ns    370316571
Popcnt64_PG                     3.83 ns         3.83 ns    190990088
Popcnt32_BranchlessLookup       1.19 ns         1.19 ns    591797267
Popcnt32_AdjacentBits           1.73 ns         1.73 ns    409381266
Popcnt32_PG                     2.01 ns         2.01 ns    344969625
```

---
Cheers,
Andy

Re: Popcount optimization for the slow-path lookups

От
Nathan Bossart
Дата:
On Fri, Dec 05, 2025 at 03:07:07PM +0200, Andrew Pogrebnoi wrote:
> I want to propose an optimization for pg_popcount32_slow() and
> pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
> branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
> compared to the current, looped version. This is on x86. It is much more
> significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32
> words.
> 
> I probably have to guard the lookup in pg_popcount64_slow() with `#if
> SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same
> function. What do you think?
> 
> For benchmarks, I used functions with the copies of the lookup loop from
> pg_popcount32/64_slow(), measuring them against branchless counterparts.
> Then in C++ I pre generated two static vectors with random uint64's and
> uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've
> run it on M1 and x86 Macs.

I don't think the proposed improvements are relevant for either of the
machines you used for your benchmarks.  For x86, we've optimized our
popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it
to use Neon or SVE.  And for other systems, we still try to use
__builtin_popcount() and friends in the fallback paths, which IIUC are
available on both gcc and clang (and maybe elsewhere).  IMHO we need to run
the benchmarks on a compiler/architecture combination where it would
actually be used in practice. 

-- 
nathan



Re: Popcount optimization for the slow-path lookups

От
Andrew Pogrebnoi
Дата:
On Fri, Dec 5, 2025 at 5:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I don't think the proposed improvements are relevant for either of the
> machines you used for your benchmarks.  For x86, we've optimized our
> popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it
> to use Neon or SVE.  And for other systems, we still try to use
> __builtin_popcount() and friends in the fallback paths, which IIUC are
> available on both gcc and clang (and maybe elsewhere).  IMHO we need to run
> the benchmarks on a compiler/architecture combination where it would
> actually be used in practice.

Yes, I saw that the code is on a rather obscure path, but those machines were my only options for quick benchmarks.
I reasoned that the code path still exists, and eliminating branching there would be beneficial anyway
(most probably). But you are right, we need to test it on target architectures/compilers. I'll try to do with that.

---
Cheers,
Andy

Re: Popcount optimization for the slow-path lookups

От
John Naylor
Дата:
On Fri, Dec 5, 2025 at 10:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I don't think the proposed improvements are relevant for either of the
> machines you used for your benchmarks.  For x86, we've optimized our
> popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it
> to use Neon or SVE.  And for other systems, we still try to use
> __builtin_popcount() and friends in the fallback paths, which IIUC are
> available on both gcc and clang (and maybe elsewhere).  IMHO we need to run
> the benchmarks on a compiler/architecture combination where it would
> actually be used in practice.

Yeah, if we did anything here, I'd rather arrange so that
architectures that have unconditional hardware support can inline it
at compile time. I believe ppc64le and aarch64 can do that
unconditionally. For x86 we might be able to detect some symbol
defined by the compiler, to do the same thing for OS's that require
such support.

--
John Naylor
Amazon Web Services