Обсуждение: Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Поиск

Список

Период

Сортировка

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

26 января, 01:07:54

On Wed, Jan 22, 2025 at 11:10:10AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
> I realized I didn't attach the patch.

Thanks.  Would you mind creating a commitfest entry for this one?

-- 
nathan

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

04 июня, 16:47:00

I have marked the commitfest entry for this [0] as waiting-on-author
because the patch needs to be rebased.

[0] https://commitfest.postgresql.org/patch/5538/

-- 
nathan

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

"Chiranmoy.Bhattacharya@fujitsu.com"

Дата:

09 июня, 12:49:54

Here's the rebased patch with a few modifications.

The hand-unrolled hex encode performs better than the non-unrolled version on

r8g.4xlarge. No improvement on m7g.4xlarge.

Added line-by-line comments explaining the changes with an example.

Below are the results. Input size is in bytes, and exec time is in ms.

encode - r8g.4xlarge

Input | master | SVE | SVE-unrolled

-------+--------+--------+--------------

8 | 4.971 | 6.434 | 6.623

16 | 8.532 | 4.399 | 4.710

24 | 12.296 | 5.007 | 5.780

32 | 16.003 | 5.027 | 5.234

40 | 19.628 | 5.807 | 6.201

48 | 23.277 | 5.815 | 6.222

56 | 26.927 | 6.744 | 7.030

64 | 30.419 | 6.774 | 6.347

128 | 83.250 | 10.214 | 9.104

256 |112.158 | 17.892 | 16.313

512 |216.544 | 31.060 | 29.876

1024 |429.351 | 59.310 | 53.374

2048 |854.677 |116.769 | 101.004

4096 |1706.528|237.322 | 195.297

8192 |3723.884|499.520 | 385.424

---------------------------------------

encode - m7g.4xlarge

Input | master | SVE | SVE-unrolled

-------+--------+--------+--------------

8 | 5.503 | 7.986 | 8.053

16 | 9.881 | 9.583 | 9.888

24 | 13.854 | 9.212 | 10.138

32 | 18.056 | 9.208 | 9.364

40 | 22.127 | 10.134 | 10.540

48 | 26.214 | 10.186 | 10.550

56 | 29.718 | 10.197 | 10.428

64 | 33.613 | 10.982 | 10.497

128 | 66.060 | 12.460 | 12.624

256 |130.225 | 18.491 | 18.872

512 |267.105 | 30.343 | 31.661

1024 |515.603 | 54.371 | 55.341

2048 |1013.766|103.898 | 105.192

4096 |2018.705|202.653 | 203.142

8192 |4000.496|400.918 | 401.842

---------------------------------------

decode - r8g.4xlarge

Input | master | SVE

-------+--------+--------

8 | 7.641 | 8.787

16 | 14.301 | 14.477

32 | 28.663 | 6.091

48 | 42.940 | 17.604

64 | 57.483 | 10.549

80 | 71.637 | 19.194

96 | 85.918 | 15.586

112 |100.272 | 25.956

128 |114.740 | 19.829

256 |229.176 | 36.032

512 |458.295 | 68.222

1024 |916.741 |132.927

2048 |1833.422|262.741

4096 |3667.096|522.009

8192 |7333.886|1042.447

---------------------------------------

decode - m7g.4xlarge

Input | master | SVE

-------+--------+--------

8 | 8.194 | 9.433

16 | 14.397 | 15.606

32 | 26.669 | 29.006

48 | 45.971 | 48.984

64 | 58.468 | 12.388

80 | 70.820 | 22.295

96 | 84.792 | 43.470

112 | 98.992 | 54.282

128 |113.250 | 25.508

256 |218.743 | 45.165

512 |414.133 | 86.800

1024 |828.493 |174.670

2048 |1617.921|346.375

4096 |3259.159|689.391

8192 |6551.879|1376.195

--------
Chiranmoy

Вложения

v5-0001-SVE-support-for-hex-coding.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

"Chiranmoy.Bhattacharya@fujitsu.com"

Дата:

03 сентября, 14:11:24

Hi all,

Since the CommitFest is underway, could we get some feedback to improve the patch?

_______

Chiranmoy

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

03 сентября, 17:48:57

On Wed, Sep 03, 2025 at 11:11:24AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
> Since the CommitFest is underway, could we get some feedback to improve
> the patch?

I see that there was some discussion about a Neon implementation upthread,
but I'm not sure we concluded anything.  For popcount, we first added a
Neon version before adding the SVE version, which required more complicated
configure/runtime checks.  Presumably Neon is available on more hardware
than SVE, so that could be a good place to start here, too.

Also, I'd strongly encourage you to get involved with others' patches on
the mailing lists (e.g., reviewing, testing).  Patch submissions are great,
but this community depends on other types of participation, too.  IME
helping others with their patches also tends to incentivize others to help
with yours.

-- 
nathan

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

John Naylor

Дата:

04 сентября, 09:20:45

On Wed, Sep 3, 2025 at 6:11 PM Chiranmoy.Bhattacharya@fujitsu.com
<Chiranmoy.Bhattacharya@fujitsu.com> wrote:
>
> Hi all,
>
> Since the CommitFest is underway, could we get some feedback to improve the patch?

On that note, I was hoping you could give us feedback on whether the
improvement in PG18 made any difference at all in your real-world
use-case, i.e. not just in a microbenchmark, but also including
transmission of the hex-encoded values across the network to the
client (that I assume must decode them again).

--
John Naylor
Amazon Web Services

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

"Chiranmoy.Bhattacharya@fujitsu.com"

Дата:

04 сентября, 17:55:50

> I see that there was some discussion about a Neon implementation upthread,
> but I'm not sure we concluded anything. For popcount, we first added a
> Neon version before adding the SVE version, which required more complicated
> configure/runtime checks. Presumably Neon is available on more hardware
> than SVE, so that could be a good place to start here, too.

We have added the Neon versions of hex encode/decode.
Here are the microbenchmark numbers.

hex_encode - m7g.4xlarge

Input | Head | Neon

-------+--------+--------

32 | 18.056 | 5.957

40 | 22.127 | 10.205

48 | 26.214 | 14.151

64 | 33.613 | 6.164

128 | 66.060 | 11.372

256 |130.225 | 18.543

512 |267.105 | 33.977

1024 |515.603 | 64.462

hex_decode - m7g.4xlarge
Input | Head | Neon

-------+--------+--------

32 | 26.669 | 9.462

40 | 36.320 | 19.347

48 | 45.971 | 19.099

64 | 58.468 | 17.648

128 |113.250 | 30.437

256 |218.743 | 56.824

512 |414.133 |107.212

1024 |828.493 |210.740

> Also, I'd strongly encourage you to get involved with others' patches on

> the mailing lists (e.g., reviewing, testing). Patch submissions are great,

> but this community depends on other types of participation, too. IME

> helping others with their patches also tends to incentivize others to help

> with yours.

Sure, we will try to test/review patches on areas we have experience.

> On that note, I was hoping you could give us feedback on whether the

> improvement in PG18 made any difference at all in your real-world

> use-case, i.e. not just in a microbenchmark, but also including

> transmission of the hex-encoded values across the network to the

> client (that I assume must decode them again).

Yes, the improvement in v18 did help, check the attached perf graphs.
We used a python script to send and receive binary data from postgres.
For simple select queries on a bytea column, hex_encode was taking
42% of the query execution time in v17, this was reduced to 33% in v18,

resulting in around 18% improvement in overall query time.

The proposed patch further reduces the hex_encode function usage to

5.6%, another 25% improvement in total query time.

We observed similar improvements for insert queries on the bytea column.
hex_decode usage decreased from 15.5% to 5.5%, a 5-8% query level
improvement depending on which storage type is used.

------

Chiranmoy

Вложения

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

11 сентября, 06:12:52

On Thu, Sep 04, 2025 at 02:55:50PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
>> I see that there was some discussion about a Neon implementation upthread,
>> but I'm not sure we concluded anything.  For popcount, we first added a
>> Neon version before adding the SVE version, which required more complicated
>> configure/runtime checks.  Presumably Neon is available on more hardware
>> than SVE, so that could be a good place to start here, too.
> 
> We have added the Neon versions of hex encode/decode.

Thanks.  I noticed that this stuff is simple enough that we can use
port/simd.h (with a few added functions).  This is especially nice because
it takes care of x86, too.  The performance gains look similar to what you
reported for v6:

                arm
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    13 |     6 |   54
      64 |    34 |     9 |   74
     256 |    93 |    25 |   73
    1024 |   281 |    78 |   72
    4096 |  1086 |   227 |   79
   16384 |  4382 |   927 |   79
   65536 | 17455 |  3608 |   79

                x86
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    10 |     7 |   30
      64 |    29 |     9 |   69
     256 |    81 |    21 |   74
    1024 |   286 |    66 |   77
    4096 |  1106 |   253 |   77
   16384 |  4383 |   980 |   78
   65536 | 17491 |  3886 |   78

I've only modified hex_encode() for now, but I'm optimistic that we can do
something similar for hex_decode().

-- 
nathan

Вложения

v7-0001-Optimize-hex_encode-using-SIMD.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

"Chiranmoy.Bhattacharya@fujitsu.com"

Дата:

11 сентября, 13:43:56

> Thanks. I noticed that this stuff is simple enough that we can use
> port/simd.h (with a few added functions). This is especially nice because
> it takes care of x86, too. The performance gains look similar to what you
> reported for v6:

This looks good, much cleaner.
One possible improvement would be to use a vectorized table lookup instead of compare and add. I compared v6 and v7 Neon versions, and v6 is always faster.
I’m not sure if SSE2 has a table lookup similar to Neon.

arm - m7g.4xlarge

buf | v6-Neon| v7-Neon| % diff

-------+--------+--------+--------

64 | 6.16 | 8.57 | 28.07

128 | 11.37 | 15.77 | 27.87

256 | 18.54 | 30.28 | 38.77

512 | 33.98 | 62.15 | 45.33

1024 | 64.46 | 117.55 | 45.16

2048 | 124.28 | 254.86 | 51.24

4096 | 243.47 | 509.23 | 52.19

8192 | 487.34 | 953.81 | 48.91

-----
Chiranmoy

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

11 сентября, 18:24:27

On Thu, Sep 11, 2025 at 10:43:56AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
> One possible improvement would be to use a vectorized table lookup
> instead of compare and add. I compared v6 and v7 Neon versions, and v6 is
> always faster.  I’m not sure if SSE2 has a table lookup similar to Neon.

I'm not finding a simple way to do that kind of table lookup in SSE2.  Part
of the reason v6 performs better is because you've unrolled the loop to
process 2 vector's worth of input data in each iteration.  This trades
performance with smaller inputs for gains with larger ones.  But even if I
do something similar for v7, v6 still wins most of the time.

My current philosophy with this stuff is to favor simplicity,
maintainability, portability, etc. over extracting the absolute maximum
amount of performance gain, so I think we should proceed with the simd.h
approach.  But I'm curious how others feel about this.

v8 is an attempt to fix the casting error on MSVC.

-- 
nathan

Вложения

v8-0001-Optimize-hex_encode-using-SIMD.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Tom Lane

Дата:

11 сентября, 18:32:00

Nathan Bossart <nathandbossart@gmail.com> writes:
> My current philosophy with this stuff is to favor simplicity,
> maintainability, portability, etc. over extracting the absolute maximum
> amount of performance gain, so I think we should proceed with the simd.h
> approach.  But I'm curious how others feel about this.

+1.  The maintainability aspect is critical over the long run.
Also, there's a very real danger of optimizing for the specific
hardware and test case you are working with, leading to actually
worse performance with future hardware.

            regards, tom lane

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

"Chiranmoy.Bhattacharya@fujitsu.com"

Дата:

12 сентября, 21:49:01

> My current philosophy with this stuff is to favor simplicity,
> maintainability, portability, etc. over extracting the absolute maximum
> amount of performance gain, so I think we should proceed with the simd.h
> approach. But I'm curious how others feel about this.

> +1. The maintainability aspect is critical over the long run.
> Also, there's a very real danger of optimizing for the specific
> hardware and test case you are working with, leading to actually
> worse performance with future hardware.

Using simd.h does make it easier to maintain.
Is there a plan to upgrade simd.h to use SSE4 or SSSE3 in the future?
Since SSE2 is much older, it lacks some of the more specialized intrinsics.
For example, vectorized table lookup can be implemented via [0], and
it’s available in SSSE3 and later x86 instruction sets.

[0] https://www.felixcloutier.com/x86/pshufb

-----
Chiranmoy

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

13 сентября, 00:30:21

On Fri, Sep 12, 2025 at 06:49:01PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
> Using simd.h does make it easier to maintain.  Is there a plan to upgrade
> simd.h to use SSE4 or SSSE3 in the future?  Since SSE2 is much older, it
> lacks some of the more specialized intrinsics.  For example, vectorized
> table lookup can be implemented via [0], and it’s available in SSSE3 and
> later x86 instruction sets.

There have been a couple of discussions about the possibility of requiring
x86-64-v2 for Postgres, but I'm not aware of any serious efforts in that
area.

I've attached a new version of the patch with a simd.h version of
hex_decode().  Here are the numbers:

                arm
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    22 |    23 |   -5
      64 |    61 |    23 |   62
     256 |   158 |    47 |   70
    1024 |   542 |   122 |   77
    4096 |  2103 |   429 |   80
   16384 |  8548 |  1673 |   80
   65536 | 34663 |  6738 |   81

                x86
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    13 |    14 |   -8
      64 |    42 |    15 |   64
     256 |   126 |    42 |   67
    1024 |   461 |   149 |   68
    4096 |  1802 |   576 |   68
   16384 |  7166 |  2280 |   68
   65536 | 28625 |  9108 |   68

A couple of notes:

* For hex_decode(), we just give up on the SIMD path and fall back on the
  scalar path as soon as we see anything outside [0-9A-Za-z].  I suspect
  this might introduce a regression for inputs of ~32 to ~64 bytes that
  include whitespace (which must be skipped) or invalid characters, but I
  don't whether those inputs are common or whether we care.

* The code makes some assumptions about endianness that might not be true
  everywhere, but I've yet to dig into this further.

-- 
nathan

Вложения

v9-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

22 сентября, 23:05:44

On Fri, Sep 12, 2025 at 04:30:21PM -0500, Nathan Bossart wrote:
> I've attached a new version of the patch with a simd.h version of
> hex_decode().  Here are the numbers:

I was able to improve the hex_decode() implementation a bit.

                arm
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |    11 |    11 |    0
      64 |    38 |     7 |   82
     256 |   133 |    18 |   86
    1024 |   513 |    67 |   87
    4096 |  2037 |   271 |   87
   16384 |  8326 |  1103 |   87
   65536 | 34550 |  4475 |   87

                x86
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
      16 |     8 |     9 |  -13
      64 |    38 |     7 |   82
     256 |   121 |    24 |   80
    1024 |   457 |    91 |   80
    4096 |  1797 |   356 |   80
   16384 |  7161 |  1411 |   80
   65536 | 28620 |  5632 |   80

-- 
nathan

Вложения

v10-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

23 сентября, 22:02:54

On Mon, Sep 22, 2025 at 03:05:44PM -0500, Nathan Bossart wrote:
> I was able to improve the hex_decode() implementation a bit.

I took a closer look at how hex_decode() performs with smaller inputs.
There are some small regressions, so I tried fixing them by adding the
following to the beginning of the function:

    if (likely(tail_idx == 0))
        return hex_decode_safe_scalar(src, len, dst, escontext);

This helped a little, but it mostly just slowed things down for larger
inputs on AArch64:

                arm
    buf  | HEAD  | patch |  fix 
  -------+-------+-------+-------
       2 |     4 |     6 |     4
       4 |     6 |     7 |     7
       8 |     8 |     8 |     8
      16 |    11 |    12 |    11
      32 |    18 |     5 |     6
      64 |    38 |     7 |     8
     256 |   134 |    18 |    24
    1024 |   514 |    67 |   100
    4096 |  2072 |   280 |   389
   16384 |  8409 |  1126 |  1537
   65536 | 34704 |  4498 |  6128

                x86
    buf  | HEAD  | patch |  fix
  -------+-------+-------+-------
       2 |     2 |     2 |     2
       4 |     3 |     3 |     3
       8 |     4 |     4 |     4
      16 |     8 |     9 |     8
      32 |    23 |     5 |     5
      64 |    37 |     7 |     7
     256 |   122 |    24 |    24
    1024 |   457 |    91 |    92
    4096 |  1798 |   357 |   358
   16384 |  7161 |  1411 |  1416
   65536 | 28621 |  5630 |  5653

I didn't do this test for hex_encode(), but I'd expect it to follow a
similar pattern.  I'm tempted to suggest that these regressions are within
tolerable levels and to forge on with v10.  In any case, IMHO this patch is
approaching committable quality, so I'd be grateful for any feedback.

-- 
nathan

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

John Naylor

Дата:

24 сентября, 06:59:38

On Wed, Sep 24, 2025 at 2:02 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Sep 22, 2025 at 03:05:44PM -0500, Nathan Bossart wrote:
> > I was able to improve the hex_decode() implementation a bit.
>
> I took a closer look at how hex_decode() performs with smaller inputs.
> There are some small regressions, so I tried fixing them by adding the
> following to the beginning of the function:
>
>     if (likely(tail_idx == 0))
>         return hex_decode_safe_scalar(src, len, dst, escontext);
>
> This helped a little, but it mostly just slowed things down for larger
> inputs on AArch64:

> I didn't do this test for hex_encode(), but I'd expect it to follow a
> similar pattern.  I'm tempted to suggest that these regressions are within
> tolerable levels and to forge on with v10.

My first thought is, I'd hazard a guess that short byteas are much
less common than short strings.

My second thought is, the decode case is not that critical. From the
end-to-end tests above, the speed of the decode case had a relatively
small global effect compared to the encode case (Perhaps because reads
are cheaper than writes).

+ if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
+ break;

But if you really want to do something here, sprinkling "(un)likely"'s
here seems like solving the wrong problem (even if they make any
difference), since the early return is optimizing for exceptional
conditions. In other places (cf. the UTF8 string verifier), we
accumulate errors, and only if we have them at the end do we restart
from the beginning with the slow error-checking path that can show the
user the offending input.

> In any case, IMHO this patch is
> approaching committable quality, so I'd be grateful for any feedback.

+vector8_sssub(const Vector8 v1, const Vector8 v2)

It's hard to parse "sss", so maybe we can borrow an Intel-ism and use
"iss" for the signed case?

+/* vector manipulation */
+#ifndef USE_NO_SIMD
+static inline Vector8 vector8_interleave_low(const Vector8 v1, const
Vector8 v2);
+static inline Vector8 vector8_interleave_high(const Vector8 v1, const
Vector8 v2);
+static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2);
+static inline Vector32 vector32_shift_left_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_byte(const Vector32 v1);

Do we need declarations for these? I recall that the existing
declarations are there for functions that are also used internally.

The nibble/byte things are rather specific. Wouldn't it be more
logical to expose the already-generic shift operations and let the
caller say by how much? Or does the compiler refuse because the
intrinsic doesn't get an immediate value? Some are like that, but I'm
not sure about these. If so, that's annoying and I wonder if there's a
workaround.

+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+ Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
+ Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
+
+ return vector8_is_highbit_set(cmpe);

We take pains to avoid signed comparison on unsigned input for the
"le" case, and I don't see why it's okay here.

Do the regression tests have long enough cases that test exceptional
paths, like invalid bytes and embedded whitespace? If not, we need
some.

--
John Naylor
Amazon Web Services

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

25 сентября, 00:40:49

On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote:
> + if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
> + break;
> 
> But if you really want to do something here, sprinkling "(un)likely"'s
> here seems like solving the wrong problem (even if they make any
> difference), since the early return is optimizing for exceptional
> conditions. In other places (cf. the UTF8 string verifier), we
> accumulate errors, and only if we have them at the end do we restart
> from the beginning with the slow error-checking path that can show the
> user the offending input.

I switched to an accumulator approach in v11.

> +vector8_sssub(const Vector8 v1, const Vector8 v2)
> 
> It's hard to parse "sss", so maybe we can borrow an Intel-ism and use
> "iss" for the signed case?

Done.

> +/* vector manipulation */
> +#ifndef USE_NO_SIMD
> +static inline Vector8 vector8_interleave_low(const Vector8 v1, const
> Vector8 v2);
> +static inline Vector8 vector8_interleave_high(const Vector8 v1, const
> Vector8 v2);
> +static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2);
> +static inline Vector32 vector32_shift_left_nibble(const Vector32 v1);
> +static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
> +static inline Vector32 vector32_shift_right_byte(const Vector32 v1);
> 
> Do we need declarations for these? I recall that the existing
> declarations are there for functions that are also used internally.

Removed.

> The nibble/byte things are rather specific. Wouldn't it be more
> logical to expose the already-generic shift operations and let the
> caller say by how much? Or does the compiler refuse because the
> intrinsic doesn't get an immediate value? Some are like that, but I'm
> not sure about these. If so, that's annoying and I wonder if there's a
> workaround.

Yeah, the compiler refuses unless the value is an integer literal.  I
thought of using a switch statement to cover all the values used in-tree,
but I didn't like that, either.

> +vector8_has_ge(const Vector8 v, const uint8 c)
> +{
> +#ifdef USE_SSE2
> + Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
> + Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
> +
> + return vector8_is_highbit_set(cmpe);
> 
> We take pains to avoid signed comparison on unsigned input for the
> "le" case, and I don't see why it's okay here.

_mm_max_epu8() does unsigned comparisons, I think...

> Do the regression tests have long enough cases that test exceptional
> paths, like invalid bytes and embedded whitespace? If not, we need
> some.

Added.

I've also fixed builds on gcc/arm64, as discussed elsewhere [0].  Here are
the current numbers on my laptop:

                arm
    buf  | HEAD  | patch | % diff
  -------+-------+-------+--------
       2 |     4 |     4 |      0
       4 |     6 |     6 |      0
       8 |     8 |     8 |      0
      16 |    11 |    12 |     -9
      32 |    18 |     5 |     72
      64 |    38 |     6 |     84
     256 |   134 |    17 |     87
    1024 |   513 |    63 |     88
    4096 |  2081 |   262 |     87
   16384 |  8524 |  1058 |     88
   65536 | 34731 |  4224 |     88

[0] https://postgr.es/m/aNQtN89VW8z-yo3B%40nathan

-- 
nathan

Вложения

v11-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

John Naylor

Дата:

25 сентября, 17:16:35

On Thu, Sep 25, 2025 at 4:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote:
> > + if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
> > + break;
> >
> > But if you really want to do something here, sprinkling "(un)likely"'s
> > here seems like solving the wrong problem (even if they make any
> > difference), since the early return is optimizing for exceptional
> > conditions. In other places (cf. the UTF8 string verifier), we
> > accumulate errors, and only if we have them at the end do we restart
> > from the beginning with the slow error-checking path that can show the
> > user the offending input.
>
> I switched to an accumulator approach in v11.

Looks good to me.

+ if (unlikely(!success))
+ i = 0;

This is after the main loop exits, and the cold path is literally one
instruction, so the motivation is not apparent to me.

> > The nibble/byte things are rather specific. Wouldn't it be more
> > logical to expose the already-generic shift operations and let the
> > caller say by how much? Or does the compiler refuse because the
> > intrinsic doesn't get an immediate value? Some are like that, but I'm
> > not sure about these. If so, that's annoying and I wonder if there's a
> > workaround.
>
> Yeah, the compiler refuses unless the value is an integer literal.  I
> thought of using a switch statement to cover all the values used in-tree,
> but I didn't like that, either.

Neither option is great, but I mildly lean towards keeping it internal
with "switch" or whatever: By putting the burden of specifying shift
amounts on separately named functions we run a risk of combinatorial
explosion in function names.

If you feel otherwise, I'd at least use actual numbers:
"shift_left_nibble" is an awkward way to say "shift left by 4 bits"
anyway, and also after "byte" and "nibble" there are not many good
English words to convey the operand amount. It's very possible that
needing other shift amounts will never come up, though.

> > +vector8_has_ge(const Vector8 v, const uint8 c)
> > +{
> > +#ifdef USE_SSE2
> > + Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
> > + Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
> > +
> > + return vector8_is_highbit_set(cmpe);
> >
> > We take pains to avoid signed comparison on unsigned input for the
> > "le" case, and I don't see why it's okay here.
>
> _mm_max_epu8() does unsigned comparisons, I think...

Ah, I confused myself about what the LE case was avoiding, namely
signed LE, not signed equality on something else.

(Separately, now I'm wondering if we can do the same for
vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
would allow getting rid of )

> > Do the regression tests have long enough cases that test exceptional
> > paths, like invalid bytes and embedded whitespace? If not, we need
> > some.
>
> Added.

Seems comprehensive enough at a glance.

Other comments:

+ * back together to form the final hex-encoded string.  It might be
+ * possible to squeeze out a little more gain by manually unrolling the
+ * loop, but for now we don't bother.

My position (and I think the community agrees) is that manual
unrolling is a rare desperation move that has to be justified, so we
don't need to mention its lack.

+ * Some compilers are picky about casts to the same underlying type, and others
+ * are picky about implicit conversions with vector types.  This function does
+ * the same thing as vector32_broadcast(), but it returns a Vector8 and is
+ * carefully crafted to avoid compiler indigestion.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_broadcast_u32(const uint32 c)
+{
+#ifdef USE_SSE2
+ return vector32_broadcast(c);
+#elif defined(USE_NEON)
+ return (Vector8) vector32_broadcast(c);
+#endif
+}

I'm ambivalent about this: The use case doesn't seem well motivated,
since I don't know why we'd actually need to both broadcast arbitrary
integers and also view the result as bytes. Setting arbitrary bytes is
what we're really doing, and would be more likely be useful in the
future (attached, only tested on x86, and I think part of the
strangeness is the endianness you mentioned above). On the other hand,
the Arm workaround results in awful generated code compared to what
you have here. Since the "set" should be hoisted out of the outer
loop, and we already rely on this pattern for vector8_highbit_mask
anyway, it might be tolerable, and we can reduce the pain with bitwise
NOT.

+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
+ * this will produce different results on different architectures.
+ */

v10 asserted this requirement -- that still seems like a good thing?

--
John Naylor
Amazon Web Services

Вложения

vector8_set.patch.txt

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

25 сентября, 21:50:20

On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:
> + if (unlikely(!success))
> + i = 0;
> 
> This is after the main loop exits, and the cold path is literally one
> instruction, so the motivation is not apparent to me.

Removed.  I was thinking about smaller inputs when I added this, but it
probably makes little difference.

>> Yeah, the compiler refuses unless the value is an integer literal.  I
>> thought of using a switch statement to cover all the values used in-tree,
>> but I didn't like that, either.
> 
> Neither option is great, but I mildly lean towards keeping it internal
> with "switch" or whatever: By putting the burden of specifying shift
> amounts on separately named functions we run a risk of combinatorial
> explosion in function names.

Done.

> (Separately, now I'm wondering if we can do the same for
> vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
> would allow getting rid of )

I think so.  I doubt there's any performance advantage, but it could be
nice for code cleanup.  (I'm assuming you meant to say vector8_ssub
(renamed to vector8_ussub() in the patch) after "getting rid of.")  I'll
do this in the related patch in the "couple of small patches for simd.h"
thread.

> + * back together to form the final hex-encoded string.  It might be
> + * possible to squeeze out a little more gain by manually unrolling the
> + * loop, but for now we don't bother.
> 
> My position (and I think the community agrees) is that manual
> unrolling is a rare desperation move that has to be justified, so we
> don't need to mention its lack.

Removed.

> + * Some compilers are picky about casts to the same underlying type, and others
> + * are picky about implicit conversions with vector types.  This function does
> + * the same thing as vector32_broadcast(), but it returns a Vector8 and is
> + * carefully crafted to avoid compiler indigestion.
> + */
> +#ifndef USE_NO_SIMD
> +static inline Vector8
> +vector8_broadcast_u32(const uint32 c)
> +{
> +#ifdef USE_SSE2
> + return vector32_broadcast(c);
> +#elif defined(USE_NEON)
> + return (Vector8) vector32_broadcast(c);
> +#endif
> +}
> 
> I'm ambivalent about this: The use case doesn't seem well motivated,
> since I don't know why we'd actually need to both broadcast arbitrary
> integers and also view the result as bytes. Setting arbitrary bytes is
> what we're really doing, and would be more likely be useful in the
> future (attached, only tested on x86, and I think part of the
> strangeness is the endianness you mentioned above). On the other hand,
> the Arm workaround results in awful generated code compared to what
> you have here. Since the "set" should be hoisted out of the outer
> loop, and we already rely on this pattern for vector8_highbit_mask
> anyway, it might be tolerable, and we can reduce the pain with bitwise
> NOT.

I think I disagree on this one.  We're not broadcasting arbitrary bytes for
every vector element, we're broadcasting a patten of bytes that happens to
be wider than the element size.  I would expect this to be a relatively
common use-case.  Furthermore, the "set" API is closely tethered to the
vector size, which is fine for SSE2/Neon but may not work down the road
(not to mention the USE_NO_SIMD path).  Also, the bitwise NOT approach
won't work because we need to use 0x0f000f00 and 0x000f000f to avoid
angering the assertion in vector8_pack_16(), as mentioned below.

> +/*
> + * Pack 16-bit elements in the given vectors into a single vector of 8-bit
> + * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
> + * this will produce different results on different architectures.
> + */
> 
> v10 asserted this requirement -- that still seems like a good thing?

I had removed that because I worried the accumulator approach would cause
it to fail (it does), but looking again, that's easy enough to work around.

-- 
nathan

Вложения

v12-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

John Naylor

Дата:

29 сентября, 11:45:27

On Fri, Sep 26, 2025 at 1:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:
> > (Separately, now I'm wondering if we can do the same for
> > vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
> > would allow getting rid of )
>
> I think so.  I doubt there's any performance advantage, but it could be
> nice for code cleanup.  (I'm assuming you meant to say vector8_ssub
> (renamed to vector8_ussub() in the patch) after "getting rid of.")

Yes right, sorry. And it seems good to do such cleanup first, since it
doesn't make sense to rename something that is about to be deleted.

> I think I disagree on this one.  We're not broadcasting arbitrary bytes for
> every vector element, we're broadcasting a patten of bytes that happens to
> be wider than the element size.  I would expect this to be a relatively
> common use-case.

That's probably true. I'm still worried that the hack for working
around compiler pickiness (while nice enough in it's current form)
might break at some point and require awareness of compiler versions.

Hmm, for this case, we can sidestep the maintainability questions
entirely by instead using the new interleave functions to build the
masks:

vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f))
vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero())

This generates identical code as v12 on Arm and is not bad on x86.
What do you think of the attached?

While looking around again, it looks like the "msk" variable isn't a
mask like the implies to me. Not sure of a better name because I'm not
sure what it represents aside from a temp variable.

+#elif defined(USE_NEON)
+ switch (i)
+ {
+ case 4:
+ return (Vector8) vshrq_n_u32((Vector32) v1, 4);
+ case 8:
+ return (Vector8) vshrq_n_u32((Vector32) v1, 8);
+ default:
+ pg_unreachable();
+ return vector8_broadcast(0);
+ }

This is just a compiler hint, so if the input is not handled I think
it will return the wrong answer rather than alerting the developer, so
we probabaly want "Assert(false)" here.

Other than that, the pack/unpack functions could use some
documentation about which parameter is low/high.

--
John Naylor
Amazon Web Services

Вложения

interleave-mask.patch.txt

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

02 октября, 20:33:26

On Mon, Sep 29, 2025 at 03:45:27PM +0700, John Naylor wrote:
> On Fri, Sep 26, 2025 at 1:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:
>> > (Separately, now I'm wondering if we can do the same for
>> > vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
>> > would allow getting rid of )
>>
>> I think so.  I doubt there's any performance advantage, but it could be
>> nice for code cleanup.  (I'm assuming you meant to say vector8_ssub
>> (renamed to vector8_ussub() in the patch) after "getting rid of.")
> 
> Yes right, sorry. And it seems good to do such cleanup first, since it
> doesn't make sense to rename something that is about to be deleted.

Will do.  I'll plan on committing the other patch [0] soon.

> Hmm, for this case, we can sidestep the maintainability questions
> entirely by instead using the new interleave functions to build the
> masks:
> 
> vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f))
> vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero())
> 
> This generates identical code as v12 on Arm and is not bad on x86.
> What do you think of the attached?

WFM

> While looking around again, it looks like the "msk" variable isn't a
> mask like the implies to me. Not sure of a better name because I'm not
> sure what it represents aside from a temp variable.

Renamed to "tmp".

> +#elif defined(USE_NEON)
> + switch (i)
> + {
> + case 4:
> + return (Vector8) vshrq_n_u32((Vector32) v1, 4);
> + case 8:
> + return (Vector8) vshrq_n_u32((Vector32) v1, 8);
> + default:
> + pg_unreachable();
> + return vector8_broadcast(0);
> + }
> 
> This is just a compiler hint, so if the input is not handled I think
> it will return the wrong answer rather than alerting the developer, so
> we probabaly want "Assert(false)" here.

Fixed.

> Other than that, the pack/unpack functions could use some
> documentation about which parameter is low/high.

Added.

[0] https://postgr.es/m/attachment/182185/v3-0001-Optimize-vector8_has_le-on-AArch64.patch

-- 
nathan

Вложения

v13-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

John Naylor

Дата:

03 октября, 10:36:47

On Fri, Oct 3, 2025 at 12:33 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> [v13]

LGTM, but I went back and checked if vector8_zero() actually does
anything different than vector8_boadcast(0), and in fact it doesn't on
compilers we support for either x86 or Arm. I pulled the former out
from older work, but it seems irrelevant now. Pardon the noise.

--
John Naylor
Amazon Web Services

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

03 октября, 23:33:21

On Fri, Oct 03, 2025 at 02:36:47PM +0700, John Naylor wrote:
> LGTM, but I went back and checked if vector8_zero() actually does
> anything different than vector8_boadcast(0), and in fact it doesn't on
> compilers we support for either x86 or Arm. I pulled the former out
> from older work, but it seems irrelevant now. Pardon the noise.

Here is what I have staged for commit.

-- 
nathan

Вложения

v14-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

От

Nathan Bossart

Дата:

06 октября, 20:33:59

On Fri, Oct 03, 2025 at 03:33:21PM -0500, Nathan Bossart wrote:
> Here is what I have staged for commit.

Committed.  That seems like a good stopping point for this work, so I have
marked the associated commitfest entry as "Committed."

-- 
nathan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения