Re: Popcount optimization using AVX512

Поиск
Список
Период
Сортировка
От Nathan Bossart
Тема Re: Popcount optimization using AVX512
Дата
Msg-id 20240307213600.GA563369@nathanxps13
обсуждение исходный текст
Ответ на Re: Popcount optimization using AVX512  (Nathan Bossart <nathandbossart@gmail.com>)
Ответы RE: Popcount optimization using AVX512  ("Amonson, Paul D" <paul.d.amonson@intel.com>)
Список pgsql-hackers
As promised...

> +# Check for Intel AVX512 intrinsics to do POPCNT calculations.
> +#
> +PGAC_AVX512_POPCNT_INTRINSICS([])
> +if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
> +  PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
> +fi
> +AC_SUBST(CFLAGS_AVX512_POPCNT)

I'm curious why we need both -mavx512vpopcntdq and -mavx512f.  On my
machine, -mavx512vpopcntdq alone is enough to pass this test, so if there
are other instructions required that need -mavx512f, then we might need to
expand the test.

> 13 files changed, 657 insertions(+), 119 deletions(-)

I still think it's worth breaking this change into at least 2 patches.  In
particular, I think there's an opportunity to do the refactoring into
pg_popcnt_choose.c and pg_popcnt_x86_64_accel.c prior to adding the AVX512
stuff.  These changes are likely straightforward, and getting them out of
the way early would make it easier to focus on the more interesting
changes.  IMHO there are a lot of moving parts in this patch.

> +#undef HAVE__GET_CPUID_COUNT
> +
> +/* Define to 1 if you have  immintrin. */
> +#undef HAVE__IMMINTRIN

Is this missing HAVE__CPUIDEX?

>  uint64
> -pg_popcount(const char *buf, int bytes)
> +pg_popcount_slow(const char *buf, int bytes)
>  {
>      uint64      popcnt = 0;
> 
> -#if SIZEOF_VOID_P >= 8
> +#if SIZEOF_VOID_P == 8
>      /* Process in 64-bit chunks if the buffer is aligned. */
>      if (buf == (const char *) TYPEALIGN(8, buf))
>      {
> @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)
> 
>          buf = (const char *) words;
>      }
> -#else
> +#elif SIZEOF_VOID_P == 4
>      /* Process in 32-bit chunks if the buffer is aligned. */
>      if (buf == (const char *) TYPEALIGN(4, buf))
>      {

Apologies for harping on this, but I'm still not seeing the need for these
SIZEOF_VOID_P changes.  While it's unlikely that this makes any practical
difference, I see no reason to more strictly check SIZEOF_VOID_P here.

> +    /* Process any remaining bytes */
> +    while (bytes--)
> +        popcnt += pg_number_of_ones[(unsigned char) *buf++];
> +    return popcnt;
> +#else
> +    return pg_popcount_slow(buf, bytes);
> +#endif                          /* USE_AVX512_CODE */

nitpick: Could we call pg_popcount_slow() in a common section for these
"remaining bytes?"

> +#if defined(_MSC_VER)
> +        pg_popcount_indirect = pg_popcount512_fast;
> +#else
> +        pg_popcount = pg_popcount512_fast;
> +#endif

These _MSC_VER sections are interesting.  I'm assuming this is the
workaround for the MSVC linking issue you mentioned above.  I haven't
looked too closely, but I wonder if the CRC32C code (see
src/include/port/pg_crc32c.h) is doing something different to avoid this
issue.

Upthread, Alvaro suggested a benchmark [0] that might be useful.  I scanned
through this thread and didn't see any recent benchmark results for the
latest form of the patch.  I think it's worth verifying that we are still
seeing the expected improvements.

[0] https://postgr.es/m/202402071953.5c4z7t6kl7ts%40alvherre.pgsql

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: postgres_fdw test timeouts