Popcount optimization using AVX512

Поиск
Список
Период
Сортировка
От Amonson, Paul D
Тема Popcount optimization using AVX512
Дата
Msg-id BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A@BL1PR11MB5304.namprd11.prod.outlook.com
обсуждение исходный текст
Ответы Re: Popcount optimization using AVX512  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Список pgsql-hackers
This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share
thepreliminary results with the community and get feedback for adding avx512 support for popcount.  
 
Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the
pg_popcount()in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this
implementationhas improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit
scenariosrelying on popcount. 
 
My setup:
 
Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
OS : Ubuntu 22.04
GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".

1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
    a. Software only and
    b. SSE 64 bit version
2. I created an implementation using the following AVX512 intrinsics:
    a. _mm512_popcnt_epi64()
    b. _mm512_reduce_add_epi64()
3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])
4. I tested 5 seeds for each input buffer size and averaged 100 runs each (5*5*100=2500 pg_popcount() calls on a single
thread)
5. Data: <See Attached picture.>

The code I wrote uses the 64-bit solution or SW on the memory not aligned to a 512-bit boundary in memory:
 
///////////////////////////////////////////////////////////////////////
// 512-bit intrisic implementation (AVX512VPOPCNTDQ + AVX512F)
uint64_t popcount_512_impl(const char *bytes, int byteCount) {
#ifdef __AVX__
    uint64_t result = 0;
    uint64_t remainder = ((uint64_t)bytes) % 64;
    result += popcount_64_impl(bytes, remainder);
    byteCount -= remainder;
    bytes += remainder;
    uint64_t vectorCount = byteCount / 64;
    remainder = byteCount % 64;
    __m512i *vectors = (__m512i *)bytes;
    __m512i rv;
    while (vectorCount--) {
        rv = _mm512_popcnt_epi64(*(vectors++));
        result += _mm512_reduce_add_epi64(rv);
    }
    bytes = (const char *)vectors;
    result += popcount_64_impl(bytes, remainder);
    return result;
#else
    return popcount_64_impl(bytes, byteCount);
#endif
}
 
There are further optimizations that can be applied here, but for demonstration I added the __AVX__ macro and if not
fallback to the original implementations in PostgreSQL. 
 
The 46% improvement in popcount is worthy of discussion considering the previous popcount 64-bit SSE and SW
implementations. 
 
 Thanks,
Paul Amonson


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: John Naylor
Дата:
Сообщение: Re: Extract numeric filed in JSONB more effectively
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Tab completion regression test failed on illumos