Обсуждение: Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Поиск
Список
Период
Сортировка

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Wed, Jan 22, 2025 at 11:10:10AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
> I realized I didn't attach the patch.

Thanks.  Would you mind creating a commitfest entry for this one?

-- 
nathan



Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
I have marked the commitfest entry for this [0] as waiting-on-author
because the patch needs to be rebased.

[0] https://commitfest.postgresql.org/patch/5538/

-- 
nathan



Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
Here's the rebased patch with a few modifications.

The hand-unrolled hex encode performs better than the non-unrolled version on
r8g.4xlarge. No improvement on m7g.4xlarge.
Added line-by-line comments explaining the changes with an example.

Below are the results. Input size is in bytes, and exec time is in ms.

encode - r8g.4xlarge

 Input | master |   SVE  | SVE-unrolled
-------+--------+--------+--------------
     8 |  4.971 |  6.434 |        6.623
    16 |  8.532 |  4.399 |        4.710
    24 | 12.296 |  5.007 |        5.780
    32 | 16.003 |  5.027 |        5.234
    40 | 19.628 |  5.807 |        6.201
    48 | 23.277 |  5.815 |        6.222
    56 | 26.927 |  6.744 |        7.030
    64 | 30.419 |  6.774 |        6.347
   128 | 83.250 | 10.214 |        9.104
   256 |112.158 | 17.892 |       16.313
   512 |216.544 | 31.060 |       29.876
  1024 |429.351 | 59.310 |       53.374
  2048 |854.677 |116.769 |      101.004
  4096 |1706.528|237.322 |      195.297
  8192 |3723.884|499.520 |      385.424
---------------------------------------

encode - m7g.4xlarge

 Input | master |   SVE  | SVE-unrolled
-------+--------+--------+--------------
     8 |  5.503 |  7.986 |        8.053
    16 |  9.881 |  9.583 |        9.888
    24 | 13.854 |  9.212 |       10.138
    32 | 18.056 |  9.208 |        9.364
    40 | 22.127 | 10.134 |       10.540
    48 | 26.214 | 10.186 |       10.550
    56 | 29.718 | 10.197 |       10.428
    64 | 33.613 | 10.982 |       10.497
   128 | 66.060 | 12.460 |       12.624
   256 |130.225 | 18.491 |       18.872
   512 |267.105 | 30.343 |       31.661
  1024 |515.603 | 54.371 |       55.341
  2048 |1013.766|103.898 |      105.192
  4096 |2018.705|202.653 |      203.142
  8192 |4000.496|400.918 |      401.842
---------------------------------------
decode - r8g.4xlarge

 Input | master |   SVE  
-------+--------+--------
     8 |  7.641 |  8.787
    16 | 14.301 | 14.477
    32 | 28.663 |  6.091
    48 | 42.940 | 17.604
    64 | 57.483 | 10.549
    80 | 71.637 | 19.194
    96 | 85.918 | 15.586
   112 |100.272 | 25.956
   128 |114.740 | 19.829
   256 |229.176 | 36.032
   512 |458.295 | 68.222
  1024 |916.741 |132.927
  2048 |1833.422|262.741
  4096 |3667.096|522.009
  8192 |7333.886|1042.447
---------------------------------------

decode - m7g.4xlarge

 Input | master |   SVE  
-------+--------+--------
     8 |  8.194 |  9.433
    16 | 14.397 | 15.606
    32 | 26.669 | 29.006
    48 | 45.971 | 48.984
    64 | 58.468 | 12.388
    80 | 70.820 | 22.295
    96 | 84.792 | 43.470
   112 | 98.992 | 54.282
   128 |113.250 | 25.508
   256 |218.743 | 45.165
   512 |414.133 | 86.800
  1024 |828.493 |174.670
  2048 |1617.921|346.375
  4096 |3259.159|689.391
  8192 |6551.879|1376.195

--------
Chiranmoy
Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
Hi all,

Since the CommitFest is underway, could we get some feedback to improve the patch?

_______
Chiranmoy

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Wed, Sep 03, 2025 at 11:11:24AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
> Since the CommitFest is underway, could we get some feedback to improve
> the patch?

I see that there was some discussion about a Neon implementation upthread,
but I'm not sure we concluded anything.  For popcount, we first added a
Neon version before adding the SVE version, which required more complicated
configure/runtime checks.  Presumably Neon is available on more hardware
than SVE, so that could be a good place to start here, too.

Also, I'd strongly encourage you to get involved with others' patches on
the mailing lists (e.g., reviewing, testing).  Patch submissions are great,
but this community depends on other types of participation, too.  IME
helping others with their patches also tends to incentivize others to help
with yours.

-- 
nathan



Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
John Naylor
Дата:
On Wed, Sep 3, 2025 at 6:11 PM Chiranmoy.Bhattacharya@fujitsu.com
<Chiranmoy.Bhattacharya@fujitsu.com> wrote:
>
> Hi all,
>
> Since the CommitFest is underway, could we get some feedback to improve the patch?

On that note, I was hoping you could give us feedback on whether the
improvement in PG18 made any difference at all in your real-world
use-case, i.e. not just in a microbenchmark, but also including
transmission of the hex-encoded values across the network to the
client (that I assume must decode them again).

--
John Naylor
Amazon Web Services



Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
> I see that there was some discussion about a Neon implementation upthread,
> but I'm not sure we concluded anything.  For popcount, we first added a
> Neon version before adding the SVE version, which required more complicated
> configure/runtime checks.  Presumably Neon is available on more hardware
> than SVE, so that could be a good place to start here, too.

We have added the Neon versions of hex encode/decode.
Here are the microbenchmark numbers.

hex_encode - m7g.4xlarge
 Input |  Head  |  Neon
-------+--------+--------
    32 | 18.056 |  5.957
    40 | 22.127 | 10.205
    48 | 26.214 | 14.151
    64 | 33.613 |  6.164
   128 | 66.060 | 11.372
   256 |130.225 | 18.543
   512 |267.105 | 33.977
  1024 |515.603 | 64.462

hex_decode - m7g.4xlarge
 Input |  Head  |  Neon
-------+--------+--------
    32 | 26.669 |  9.462
    40 | 36.320 | 19.347
    48 | 45.971 | 19.099
    64 | 58.468 | 17.648
   128 |113.250 | 30.437
   256 |218.743 | 56.824
   512 |414.133 |107.212
  1024 |828.493 |210.740


> Also, I'd strongly encourage you to get involved with others' patches on
> the mailing lists (e.g., reviewing, testing).  Patch submissions are great,
> but this community depends on other types of participation, too.  IME
> helping others with their patches also tends to incentivize others to help
> with yours.

Sure, we will try to test/review patches on areas we have experience.


> On that note, I was hoping you could give us feedback on whether the
> improvement in PG18 made any difference at all in your real-world
> use-case, i.e. not just in a microbenchmark, but also including
> transmission of the hex-encoded values across the network to the
> client (that I assume must decode them again).

Yes, the improvement in v18 did help, check the attached perf graphs.
We used a python script to send and receive binary data from postgres.
For simple select queries on a bytea column, hex_encode was taking
42% of the query execution time in v17, this was reduced to 33% in v18,
resulting in around 18% improvement in overall query time.

The proposed patch further reduces the hex_encode function usage to
5.6%, another 25% improvement in total query time.

We observed similar improvements for insert queries on the bytea column.
hex_decode usage decreased from 15.5% to 5.5%, a 5-8% query level
improvement depending on which storage type is used.

------
Chiranmoy
Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Thu, Sep 04, 2025 at 02:55:50PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
>> I see that there was some discussion about a Neon implementation upthread,
>> but I'm not sure we concluded anything.  For popcount, we first added a
>> Neon version before adding the SVE version, which required more complicated
>> configure/runtime checks.  Presumably Neon is available on more hardware
>> than SVE, so that could be a good place to start here, too.
> 
> We have added the Neon versions of hex encode/decode.

Thanks.  I noticed that this stuff is simple enough that we can use
port/simd.h (with a few added functions).  This is especially nice because
it takes care of x86, too.  The performance gains look similar to what you
reported for v6:

                arm
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    13 |     6 |   54
      64 |    34 |     9 |   74
     256 |    93 |    25 |   73
    1024 |   281 |    78 |   72
    4096 |  1086 |   227 |   79
   16384 |  4382 |   927 |   79
   65536 | 17455 |  3608 |   79

                x86
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    10 |     7 |   30
      64 |    29 |     9 |   69
     256 |    81 |    21 |   74
    1024 |   286 |    66 |   77
    4096 |  1106 |   253 |   77
   16384 |  4383 |   980 |   78
   65536 | 17491 |  3886 |   78

I've only modified hex_encode() for now, but I'm optimistic that we can do
something similar for hex_decode().

-- 
nathan

Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
> Thanks.  I noticed that this stuff is simple enough that we can use
> port/simd.h (with a few added functions).  This is especially nice because
> it takes care of x86, too.  The performance gains look similar to what you
> reported for v6:

This looks good, much cleaner.
One possible improvement would be to use a vectorized table lookup instead of compare and add. I compared v6 and v7 Neon versions, and v6 is always faster.
I’m not sure if SSE2 has a table lookup similar to Neon.

arm - m7g.4xlarge
  buf  | v6-Neon| v7-Neon| % diff
-------+--------+--------+--------
    64 |   6.16 |   8.57 |  28.07
   128 |  11.37 |  15.77 |  27.87
   256 |  18.54 |  30.28 |  38.77
   512 |  33.98 |  62.15 |  45.33
  1024 |  64.46 | 117.55 |  45.16
  2048 | 124.28 | 254.86 |  51.24
  4096 | 243.47 | 509.23 |  52.19
  8192 | 487.34 | 953.81 |  48.91

-----
Chiranmoy

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Thu, Sep 11, 2025 at 10:43:56AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
> One possible improvement would be to use a vectorized table lookup
> instead of compare and add. I compared v6 and v7 Neon versions, and v6 is
> always faster.  I’m not sure if SSE2 has a table lookup similar to Neon.

I'm not finding a simple way to do that kind of table lookup in SSE2.  Part
of the reason v6 performs better is because you've unrolled the loop to
process 2 vector's worth of input data in each iteration.  This trades
performance with smaller inputs for gains with larger ones.  But even if I
do something similar for v7, v6 still wins most of the time.

My current philosophy with this stuff is to favor simplicity,
maintainability, portability, etc. over extracting the absolute maximum
amount of performance gain, so I think we should proceed with the simd.h
approach.  But I'm curious how others feel about this.

v8 is an attempt to fix the casting error on MSVC.

-- 
nathan

Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Tom Lane
Дата:
Nathan Bossart <nathandbossart@gmail.com> writes:
> My current philosophy with this stuff is to favor simplicity,
> maintainability, portability, etc. over extracting the absolute maximum
> amount of performance gain, so I think we should proceed with the simd.h
> approach.  But I'm curious how others feel about this.

+1.  The maintainability aspect is critical over the long run.
Also, there's a very real danger of optimizing for the specific
hardware and test case you are working with, leading to actually
worse performance with future hardware.

            regards, tom lane



Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
> My current philosophy with this stuff is to favor simplicity,
> maintainability, portability, etc. over extracting the absolute maximum
> amount of performance gain, so I think we should proceed with the simd.h
> approach. But I'm curious how others feel about this.

>  +1. The maintainability aspect is critical over the long run.
> Also, there's a very real danger of optimizing for the specific
> hardware and test case you are working with, leading to actually
> worse performance with future hardware.

Using simd.h does make it easier to maintain.
Is there a plan to upgrade simd.h to use SSE4 or SSSE3 in the future?
Since SSE2 is much older, it lacks some of the more specialized intrinsics.
For example, vectorized table lookup can be implemented via [0], and
it’s available in SSSE3 and later x86 instruction sets.


-----
Chiranmoy

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Fri, Sep 12, 2025 at 06:49:01PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
> Using simd.h does make it easier to maintain.  Is there a plan to upgrade
> simd.h to use SSE4 or SSSE3 in the future?  Since SSE2 is much older, it
> lacks some of the more specialized intrinsics.  For example, vectorized
> table lookup can be implemented via [0], and it’s available in SSSE3 and
> later x86 instruction sets.

There have been a couple of discussions about the possibility of requiring
x86-64-v2 for Postgres, but I'm not aware of any serious efforts in that
area.

I've attached a new version of the patch with a simd.h version of
hex_decode().  Here are the numbers:

                arm
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    22 |    23 |   -5
      64 |    61 |    23 |   62
     256 |   158 |    47 |   70
    1024 |   542 |   122 |   77
    4096 |  2103 |   429 |   80
   16384 |  8548 |  1673 |   80
   65536 | 34663 |  6738 |   81

                x86
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    13 |    14 |   -8
      64 |    42 |    15 |   64
     256 |   126 |    42 |   67
    1024 |   461 |   149 |   68
    4096 |  1802 |   576 |   68
   16384 |  7166 |  2280 |   68
   65536 | 28625 |  9108 |   68

A couple of notes:

* For hex_decode(), we just give up on the SIMD path and fall back on the
  scalar path as soon as we see anything outside [0-9A-Za-z].  I suspect
  this might introduce a regression for inputs of ~32 to ~64 bytes that
  include whitespace (which must be skipped) or invalid characters, but I
  don't whether those inputs are common or whether we care.

* The code makes some assumptions about endianness that might not be true
  everywhere, but I've yet to dig into this further.

-- 
nathan

Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Fri, Sep 12, 2025 at 04:30:21PM -0500, Nathan Bossart wrote:
> I've attached a new version of the patch with a simd.h version of
> hex_decode().  Here are the numbers:

I was able to improve the hex_decode() implementation a bit.

                arm
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    11 |    11 |    0
      64 |    38 |     7 |   82
     256 |   133 |    18 |   86
    1024 |   513 |    67 |   87
    4096 |  2037 |   271 |   87
   16384 |  8326 |  1103 |   87
   65536 | 34550 |  4475 |   87

                x86
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |     8 |     9 |  -13
      64 |    38 |     7 |   82
     256 |   121 |    24 |   80
    1024 |   457 |    91 |   80
    4096 |  1797 |   356 |   80
   16384 |  7161 |  1411 |   80
   65536 | 28620 |  5632 |   80

-- 
nathan

Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Mon, Sep 22, 2025 at 03:05:44PM -0500, Nathan Bossart wrote:
> I was able to improve the hex_decode() implementation a bit.

I took a closer look at how hex_decode() performs with smaller inputs.
There are some small regressions, so I tried fixing them by adding the
following to the beginning of the function:

    if (likely(tail_idx == 0))
        return hex_decode_safe_scalar(src, len, dst, escontext);

This helped a little, but it mostly just slowed things down for larger
inputs on AArch64:

                arm
    buf  | HEAD  | patch |  fix 
  -------+-------+-------+-------
       2 |     4 |     6 |     4
       4 |     6 |     7 |     7
       8 |     8 |     8 |     8
      16 |    11 |    12 |    11
      32 |    18 |     5 |     6
      64 |    38 |     7 |     8
     256 |   134 |    18 |    24
    1024 |   514 |    67 |   100
    4096 |  2072 |   280 |   389
   16384 |  8409 |  1126 |  1537
   65536 | 34704 |  4498 |  6128

                x86
    buf  | HEAD  | patch |  fix
  -------+-------+-------+-------
       2 |     2 |     2 |     2
       4 |     3 |     3 |     3
       8 |     4 |     4 |     4
      16 |     8 |     9 |     8
      32 |    23 |     5 |     5
      64 |    37 |     7 |     7
     256 |   122 |    24 |    24
    1024 |   457 |    91 |    92
    4096 |  1798 |   357 |   358
   16384 |  7161 |  1411 |  1416
   65536 | 28621 |  5630 |  5653

I didn't do this test for hex_encode(), but I'd expect it to follow a
similar pattern.  I'm tempted to suggest that these regressions are within
tolerable levels and to forge on with v10.  In any case, IMHO this patch is
approaching committable quality, so I'd be grateful for any feedback.

-- 
nathan



Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
John Naylor
Дата:
On Wed, Sep 24, 2025 at 2:02 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Sep 22, 2025 at 03:05:44PM -0500, Nathan Bossart wrote:
> > I was able to improve the hex_decode() implementation a bit.
>
> I took a closer look at how hex_decode() performs with smaller inputs.
> There are some small regressions, so I tried fixing them by adding the
> following to the beginning of the function:
>
>     if (likely(tail_idx == 0))
>         return hex_decode_safe_scalar(src, len, dst, escontext);
>
> This helped a little, but it mostly just slowed things down for larger
> inputs on AArch64:

> I didn't do this test for hex_encode(), but I'd expect it to follow a
> similar pattern.  I'm tempted to suggest that these regressions are within
> tolerable levels and to forge on with v10.

My first thought is, I'd hazard a guess that short byteas are much
less common than short strings.

My second thought is, the decode case is not that critical. From the
end-to-end tests above, the speed of the decode case had a relatively
small global effect compared to the encode case (Perhaps because reads
are cheaper than writes).

+ if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
+ break;

But if you really want to do something here, sprinkling "(un)likely"'s
here seems like solving the wrong problem (even if they make any
difference), since the early return is optimizing for exceptional
conditions. In other places (cf. the UTF8 string verifier), we
accumulate errors, and only if we have them at the end do we restart
from the beginning with the slow error-checking path that can show the
user the offending input.

> In any case, IMHO this patch is
> approaching committable quality, so I'd be grateful for any feedback.

+vector8_sssub(const Vector8 v1, const Vector8 v2)

It's hard to parse "sss", so maybe we can borrow an Intel-ism and use
"iss" for the signed case?

+/* vector manipulation */
+#ifndef USE_NO_SIMD
+static inline Vector8 vector8_interleave_low(const Vector8 v1, const
Vector8 v2);
+static inline Vector8 vector8_interleave_high(const Vector8 v1, const
Vector8 v2);
+static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2);
+static inline Vector32 vector32_shift_left_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_byte(const Vector32 v1);

Do we need declarations for these? I recall that the existing
declarations are there for functions that are also used internally.

The nibble/byte things are rather specific. Wouldn't it be more
logical to expose the already-generic shift operations and let the
caller say by how much? Or does the compiler refuse because the
intrinsic doesn't get an immediate value? Some are like that, but I'm
not sure about these. If so, that's annoying and I wonder if there's a
workaround.

+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+ Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
+ Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
+
+ return vector8_is_highbit_set(cmpe);

We take pains to avoid signed comparison on unsigned input for the
"le" case, and I don't see why it's okay here.

Do the regression tests have long enough cases that test exceptional
paths, like invalid bytes and embedded whitespace? If not, we need
some.

--
John Naylor
Amazon Web Services



Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote:
> + if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
> + break;
> 
> But if you really want to do something here, sprinkling "(un)likely"'s
> here seems like solving the wrong problem (even if they make any
> difference), since the early return is optimizing for exceptional
> conditions. In other places (cf. the UTF8 string verifier), we
> accumulate errors, and only if we have them at the end do we restart
> from the beginning with the slow error-checking path that can show the
> user the offending input.

I switched to an accumulator approach in v11.

> +vector8_sssub(const Vector8 v1, const Vector8 v2)
> 
> It's hard to parse "sss", so maybe we can borrow an Intel-ism and use
> "iss" for the signed case?

Done.

> +/* vector manipulation */
> +#ifndef USE_NO_SIMD
> +static inline Vector8 vector8_interleave_low(const Vector8 v1, const
> Vector8 v2);
> +static inline Vector8 vector8_interleave_high(const Vector8 v1, const
> Vector8 v2);
> +static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2);
> +static inline Vector32 vector32_shift_left_nibble(const Vector32 v1);
> +static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
> +static inline Vector32 vector32_shift_right_byte(const Vector32 v1);
> 
> Do we need declarations for these? I recall that the existing
> declarations are there for functions that are also used internally.

Removed.

> The nibble/byte things are rather specific. Wouldn't it be more
> logical to expose the already-generic shift operations and let the
> caller say by how much? Or does the compiler refuse because the
> intrinsic doesn't get an immediate value? Some are like that, but I'm
> not sure about these. If so, that's annoying and I wonder if there's a
> workaround.

Yeah, the compiler refuses unless the value is an integer literal.  I
thought of using a switch statement to cover all the values used in-tree,
but I didn't like that, either.

> +vector8_has_ge(const Vector8 v, const uint8 c)
> +{
> +#ifdef USE_SSE2
> + Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
> + Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
> +
> + return vector8_is_highbit_set(cmpe);
> 
> We take pains to avoid signed comparison on unsigned input for the
> "le" case, and I don't see why it's okay here.

_mm_max_epu8() does unsigned comparisons, I think...

> Do the regression tests have long enough cases that test exceptional
> paths, like invalid bytes and embedded whitespace? If not, we need
> some.

Added.

I've also fixed builds on gcc/arm64, as discussed elsewhere [0].  Here are
the current numbers on my laptop:

                arm
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
       2 |     4 |     4 |      0
       4 |     6 |     6 |      0
       8 |     8 |     8 |      0
      16 |    11 |    12 |     -9
      32 |    18 |     5 |     72
      64 |    38 |     6 |     84
     256 |   134 |    17 |     87
    1024 |   513 |    63 |     88
    4096 |  2081 |   262 |     87
   16384 |  8524 |  1058 |     88
   65536 | 34731 |  4224 |     88

[0] https://postgr.es/m/aNQtN89VW8z-yo3B%40nathan

-- 
nathan

Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
John Naylor
Дата:
On Thu, Sep 25, 2025 at 4:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote:
> > + if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
> > + break;
> >
> > But if you really want to do something here, sprinkling "(un)likely"'s
> > here seems like solving the wrong problem (even if they make any
> > difference), since the early return is optimizing for exceptional
> > conditions. In other places (cf. the UTF8 string verifier), we
> > accumulate errors, and only if we have them at the end do we restart
> > from the beginning with the slow error-checking path that can show the
> > user the offending input.
>
> I switched to an accumulator approach in v11.

Looks good to me.

+ if (unlikely(!success))
+ i = 0;

This is after the main loop exits, and the cold path is literally one
instruction, so the motivation is not apparent to me.

> > The nibble/byte things are rather specific. Wouldn't it be more
> > logical to expose the already-generic shift operations and let the
> > caller say by how much? Or does the compiler refuse because the
> > intrinsic doesn't get an immediate value? Some are like that, but I'm
> > not sure about these. If so, that's annoying and I wonder if there's a
> > workaround.
>
> Yeah, the compiler refuses unless the value is an integer literal.  I
> thought of using a switch statement to cover all the values used in-tree,
> but I didn't like that, either.

Neither option is great, but I mildly lean towards keeping it internal
with "switch" or whatever: By putting the burden of specifying shift
amounts on separately named functions we run a risk of combinatorial
explosion in function names.

If you feel otherwise, I'd at least use actual numbers:
"shift_left_nibble" is an awkward way to say "shift left by 4 bits"
anyway, and also after "byte" and "nibble" there are not many good
English words to convey the operand amount. It's very possible that
needing other shift amounts will never come up, though.

> > +vector8_has_ge(const Vector8 v, const uint8 c)
> > +{
> > +#ifdef USE_SSE2
> > + Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
> > + Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
> > +
> > + return vector8_is_highbit_set(cmpe);
> >
> > We take pains to avoid signed comparison on unsigned input for the
> > "le" case, and I don't see why it's okay here.
>
> _mm_max_epu8() does unsigned comparisons, I think...

Ah, I confused myself about what the LE case was avoiding, namely
signed LE, not signed equality on something else.

(Separately, now I'm wondering if we can do the same for
vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
would allow getting rid of )

> > Do the regression tests have long enough cases that test exceptional
> > paths, like invalid bytes and embedded whitespace? If not, we need
> > some.
>
> Added.

Seems comprehensive enough at a glance.

Other comments:

+ * back together to form the final hex-encoded string.  It might be
+ * possible to squeeze out a little more gain by manually unrolling the
+ * loop, but for now we don't bother.

My position (and I think the community agrees) is that manual
unrolling is a rare desperation move that has to be justified, so we
don't need to mention its lack.

+ * Some compilers are picky about casts to the same underlying type, and others
+ * are picky about implicit conversions with vector types.  This function does
+ * the same thing as vector32_broadcast(), but it returns a Vector8 and is
+ * carefully crafted to avoid compiler indigestion.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_broadcast_u32(const uint32 c)
+{
+#ifdef USE_SSE2
+ return vector32_broadcast(c);
+#elif defined(USE_NEON)
+ return (Vector8) vector32_broadcast(c);
+#endif
+}

I'm ambivalent about this: The use case doesn't seem well motivated,
since I don't know why we'd actually need to both broadcast arbitrary
integers and also view the result as bytes. Setting arbitrary bytes is
what we're really doing, and would be more likely be useful in the
future (attached, only tested on x86, and I think part of the
strangeness is the endianness you mentioned above). On the other hand,
the Arm workaround results in awful generated code compared to what
you have here. Since the "set" should be hoisted out of the outer
loop, and we already rely on this pattern for vector8_highbit_mask
anyway, it might be tolerable, and we can reduce the pain with bitwise
NOT.

+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
+ * this will produce different results on different architectures.
+ */

v10 asserted this requirement -- that still seems like a good thing?


--
John Naylor
Amazon Web Services

Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:
> + if (unlikely(!success))
> + i = 0;
> 
> This is after the main loop exits, and the cold path is literally one
> instruction, so the motivation is not apparent to me.

Removed.  I was thinking about smaller inputs when I added this, but it
probably makes little difference.

>> Yeah, the compiler refuses unless the value is an integer literal.  I
>> thought of using a switch statement to cover all the values used in-tree,
>> but I didn't like that, either.
> 
> Neither option is great, but I mildly lean towards keeping it internal
> with "switch" or whatever: By putting the burden of specifying shift
> amounts on separately named functions we run a risk of combinatorial
> explosion in function names.

Done.

> (Separately, now I'm wondering if we can do the same for
> vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
> would allow getting rid of )

I think so.  I doubt there's any performance advantage, but it could be
nice for code cleanup.  (I'm assuming you meant to say vector8_ssub
(renamed to vector8_ussub() in the patch) after "getting rid of.")  I'll
do this in the related patch in the "couple of small patches for simd.h"
thread.

> + * back together to form the final hex-encoded string.  It might be
> + * possible to squeeze out a little more gain by manually unrolling the
> + * loop, but for now we don't bother.
> 
> My position (and I think the community agrees) is that manual
> unrolling is a rare desperation move that has to be justified, so we
> don't need to mention its lack.

Removed.

> + * Some compilers are picky about casts to the same underlying type, and others
> + * are picky about implicit conversions with vector types.  This function does
> + * the same thing as vector32_broadcast(), but it returns a Vector8 and is
> + * carefully crafted to avoid compiler indigestion.
> + */
> +#ifndef USE_NO_SIMD
> +static inline Vector8
> +vector8_broadcast_u32(const uint32 c)
> +{
> +#ifdef USE_SSE2
> + return vector32_broadcast(c);
> +#elif defined(USE_NEON)
> + return (Vector8) vector32_broadcast(c);
> +#endif
> +}
> 
> I'm ambivalent about this: The use case doesn't seem well motivated,
> since I don't know why we'd actually need to both broadcast arbitrary
> integers and also view the result as bytes. Setting arbitrary bytes is
> what we're really doing, and would be more likely be useful in the
> future (attached, only tested on x86, and I think part of the
> strangeness is the endianness you mentioned above). On the other hand,
> the Arm workaround results in awful generated code compared to what
> you have here. Since the "set" should be hoisted out of the outer
> loop, and we already rely on this pattern for vector8_highbit_mask
> anyway, it might be tolerable, and we can reduce the pain with bitwise
> NOT.

I think I disagree on this one.  We're not broadcasting arbitrary bytes for
every vector element, we're broadcasting a patten of bytes that happens to
be wider than the element size.  I would expect this to be a relatively
common use-case.  Furthermore, the "set" API is closely tethered to the
vector size, which is fine for SSE2/Neon but may not work down the road
(not to mention the USE_NO_SIMD path).  Also, the bitwise NOT approach
won't work because we need to use 0x0f000f00 and 0x000f000f to avoid
angering the assertion in vector8_pack_16(), as mentioned below.

> +/*
> + * Pack 16-bit elements in the given vectors into a single vector of 8-bit
> + * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
> + * this will produce different results on different architectures.
> + */
> 
> v10 asserted this requirement -- that still seems like a good thing?

I had removed that because I worried the accumulator approach would cause
it to fail (it does), but looking again, that's easy enough to work around.

-- 
nathan

Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
John Naylor
Дата:
On Fri, Sep 26, 2025 at 1:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:
> > (Separately, now I'm wondering if we can do the same for
> > vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
> > would allow getting rid of )
>
> I think so.  I doubt there's any performance advantage, but it could be
> nice for code cleanup.  (I'm assuming you meant to say vector8_ssub
> (renamed to vector8_ussub() in the patch) after "getting rid of.")

Yes right, sorry. And it seems good to do such cleanup first, since it
doesn't make sense to rename something that is about to be deleted.

> I think I disagree on this one.  We're not broadcasting arbitrary bytes for
> every vector element, we're broadcasting a patten of bytes that happens to
> be wider than the element size.  I would expect this to be a relatively
> common use-case.

That's probably true. I'm still worried that the hack for working
around compiler pickiness (while nice enough in it's current form)
might break at some point and require awareness of compiler versions.

Hmm, for this case, we can sidestep the maintainability questions
entirely by instead using the new interleave functions to build the
masks:

vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f))
vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero())

This generates identical code as v12 on Arm and is not bad on x86.
What do you think of the attached?

While looking around again, it looks like the "msk" variable isn't a
mask like the implies to me. Not sure of a better name because I'm not
sure what it represents aside from a temp variable.

+#elif defined(USE_NEON)
+ switch (i)
+ {
+ case 4:
+ return (Vector8) vshrq_n_u32((Vector32) v1, 4);
+ case 8:
+ return (Vector8) vshrq_n_u32((Vector32) v1, 8);
+ default:
+ pg_unreachable();
+ return vector8_broadcast(0);
+ }

This is just a compiler hint, so if the input is not handled I think
it will return the wrong answer rather than alerting the developer, so
we probabaly want "Assert(false)" here.

Other than that, the pack/unpack functions could use some
documentation about which parameter is low/high.

--
John Naylor
Amazon Web Services

Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От
Nathan Bossart
Дата:
On Mon, Sep 29, 2025 at 03:45:27PM +0700, John Naylor wrote:
> On Fri, Sep 26, 2025 at 1:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:
>> > (Separately, now I'm wondering if we can do the same for
>> > vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
>> > would allow getting rid of )
>>
>> I think so.  I doubt there's any performance advantage, but it could be
>> nice for code cleanup.  (I'm assuming you meant to say vector8_ssub
>> (renamed to vector8_ussub() in the patch) after "getting rid of.")
> 
> Yes right, sorry. And it seems good to do such cleanup first, since it
> doesn't make sense to rename something that is about to be deleted.

Will do.  I'll plan on committing the other patch [0] soon.

> Hmm, for this case, we can sidestep the maintainability questions
> entirely by instead using the new interleave functions to build the
> masks:
> 
> vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f))
> vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero())
> 
> This generates identical code as v12 on Arm and is not bad on x86.
> What do you think of the attached?

WFM

> While looking around again, it looks like the "msk" variable isn't a
> mask like the implies to me. Not sure of a better name because I'm not
> sure what it represents aside from a temp variable.

Renamed to "tmp".

> +#elif defined(USE_NEON)
> + switch (i)
> + {
> + case 4:
> + return (Vector8) vshrq_n_u32((Vector32) v1, 4);
> + case 8:
> + return (Vector8) vshrq_n_u32((Vector32) v1, 8);
> + default:
> + pg_unreachable();
> + return vector8_broadcast(0);
> + }
> 
> This is just a compiler hint, so if the input is not handled I think
> it will return the wrong answer rather than alerting the developer, so
> we probabaly want "Assert(false)" here.

Fixed.

> Other than that, the pack/unpack functions could use some
> documentation about which parameter is low/high.

Added.

[0] https://postgr.es/m/attachment/182185/v3-0001-Optimize-vector8_has_le-on-AArch64.patch

-- 
nathan

Вложения