Обсуждение: Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Wed, Jan 22, 2025 at 11:10:10AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: > I realized I didn't attach the patch. Thanks. Would you mind creating a commitfest entry for this one? -- nathan
I have marked the commitfest entry for this [0] as waiting-on-author because the patch needs to be rebased. [0] https://commitfest.postgresql.org/patch/5538/ -- nathan
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
Here's the rebased patch with a few modifications.
The hand-unrolled hex encode performs better than the non-unrolled version on
r8g.4xlarge. No improvement on m7g.4xlarge.
Added line-by-line comments explaining the changes with an example.
Below are the results. Input size is in bytes, and exec time is in ms.
encode - r8g.4xlarge
Input | master | SVE | SVE-unrolled
-------+--------+--------+--------------
8 | 4.971 | 6.434 | 6.623
16 | 8.532 | 4.399 | 4.710
24 | 12.296 | 5.007 | 5.780
32 | 16.003 | 5.027 | 5.234
40 | 19.628 | 5.807 | 6.201
48 | 23.277 | 5.815 | 6.222
56 | 26.927 | 6.744 | 7.030
64 | 30.419 | 6.774 | 6.347
128 | 83.250 | 10.214 | 9.104
256 |112.158 | 17.892 | 16.313
512 |216.544 | 31.060 | 29.876
1024 |429.351 | 59.310 | 53.374
2048 |854.677 |116.769 | 101.004
4096 |1706.528|237.322 | 195.297
8192 |3723.884|499.520 | 385.424
---------------------------------------
encode - m7g.4xlarge
Input | master | SVE | SVE-unrolled
-------+--------+--------+--------------
8 | 5.503 | 7.986 | 8.053
16 | 9.881 | 9.583 | 9.888
24 | 13.854 | 9.212 | 10.138
32 | 18.056 | 9.208 | 9.364
40 | 22.127 | 10.134 | 10.540
48 | 26.214 | 10.186 | 10.550
56 | 29.718 | 10.197 | 10.428
64 | 33.613 | 10.982 | 10.497
128 | 66.060 | 12.460 | 12.624
256 |130.225 | 18.491 | 18.872
512 |267.105 | 30.343 | 31.661
1024 |515.603 | 54.371 | 55.341
2048 |1013.766|103.898 | 105.192
4096 |2018.705|202.653 | 203.142
8192 |4000.496|400.918 | 401.842
---------------------------------------
decode - r8g.4xlarge
Input | master | SVE
-------+--------+--------
8 | 7.641 | 8.787
16 | 14.301 | 14.477
32 | 28.663 | 6.091
48 | 42.940 | 17.604
64 | 57.483 | 10.549
80 | 71.637 | 19.194
96 | 85.918 | 15.586
112 |100.272 | 25.956
128 |114.740 | 19.829
256 |229.176 | 36.032
512 |458.295 | 68.222
1024 |916.741 |132.927
2048 |1833.422|262.741
4096 |3667.096|522.009
8192 |7333.886|1042.447
---------------------------------------
decode - m7g.4xlarge
Input | master | SVE
-------+--------+--------
8 | 8.194 | 9.433
16 | 14.397 | 15.606
32 | 26.669 | 29.006
48 | 45.971 | 48.984
64 | 58.468 | 12.388
80 | 70.820 | 22.295
96 | 84.792 | 43.470
112 | 98.992 | 54.282
128 |113.250 | 25.508
256 |218.743 | 45.165
512 |414.133 | 86.800
1024 |828.493 |174.670
2048 |1617.921|346.375
4096 |3259.159|689.391
8192 |6551.879|1376.195
--------
Chiranmoy
Вложения
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
Hi all,
Since the CommitFest is underway, could we get some feedback to improve the patch?
Since the CommitFest is underway, could we get some feedback to improve the patch?
_______
Chiranmoy
On Wed, Sep 03, 2025 at 11:11:24AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: > Since the CommitFest is underway, could we get some feedback to improve > the patch? I see that there was some discussion about a Neon implementation upthread, but I'm not sure we concluded anything. For popcount, we first added a Neon version before adding the SVE version, which required more complicated configure/runtime checks. Presumably Neon is available on more hardware than SVE, so that could be a good place to start here, too. Also, I'd strongly encourage you to get involved with others' patches on the mailing lists (e.g., reviewing, testing). Patch submissions are great, but this community depends on other types of participation, too. IME helping others with their patches also tends to incentivize others to help with yours. -- nathan
On Wed, Sep 3, 2025 at 6:11 PM Chiranmoy.Bhattacharya@fujitsu.com <Chiranmoy.Bhattacharya@fujitsu.com> wrote: > > Hi all, > > Since the CommitFest is underway, could we get some feedback to improve the patch? On that note, I was hoping you could give us feedback on whether the improvement in PG18 made any difference at all in your real-world use-case, i.e. not just in a microbenchmark, but also including transmission of the hex-encoded values across the network to the client (that I assume must decode them again). -- John Naylor Amazon Web Services
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
> I see that there was some discussion about a Neon implementation upthread,
> but I'm not sure we concluded anything. For popcount, we first added a
> Neon version before adding the SVE version, which required more complicated
> configure/runtime checks. Presumably Neon is available on more hardware
> than SVE, so that could be a good place to start here, too.
> but I'm not sure we concluded anything. For popcount, we first added a
> Neon version before adding the SVE version, which required more complicated
> configure/runtime checks. Presumably Neon is available on more hardware
> than SVE, so that could be a good place to start here, too.
We have added the Neon versions of hex encode/decode.
Here are the microbenchmark numbers.
Here are the microbenchmark numbers.
hex_encode - m7g.4xlarge
Input | Head | Neon
-------+--------+--------
32 | 18.056 | 5.957
40 | 22.127 | 10.205
48 | 26.214 | 14.151
64 | 33.613 | 6.164
128 | 66.060 | 11.372
256 |130.225 | 18.543
512 |267.105 | 33.977
1024 |515.603 | 64.462
hex_decode - m7g.4xlarge
Input | Head | Neon
Input | Head | Neon
-------+--------+--------
32 | 26.669 | 9.462
40 | 36.320 | 19.347
48 | 45.971 | 19.099
64 | 58.468 | 17.648
128 |113.250 | 30.437
256 |218.743 | 56.824
512 |414.133 |107.212
1024 |828.493 |210.740
> Also, I'd strongly encourage you to get involved with others' patches on
> Also, I'd strongly encourage you to get involved with others' patches on
> the mailing lists (e.g., reviewing, testing). Patch submissions are great,
> but this community depends on other types of participation, too. IME
> helping others with their patches also tends to incentivize others to help
> with yours.
Sure, we will try to test/review patches on areas we have experience.
> On that note, I was hoping you could give us feedback on whether the
> improvement in PG18 made any difference at all in your real-world
> use-case, i.e. not just in a microbenchmark, but also including
> transmission of the hex-encoded values across the network to the
> client (that I assume must decode them again).
Yes, the improvement in v18 did help, check the attached perf graphs.
We used a python script to send and receive binary data from postgres.
For simple select queries on a bytea column, hex_encode was taking
42% of the query execution time in v17, this was reduced to 33% in v18,
resulting in around 18% improvement in overall query time.
The proposed patch further reduces the hex_encode function usage to
The proposed patch further reduces the hex_encode function usage to
5.6%, another 25% improvement in total query time.
We observed similar improvements for insert queries on the bytea column.
hex_decode usage decreased from 15.5% to 5.5%, a 5-8% query level
improvement depending on which storage type is used.
We observed similar improvements for insert queries on the bytea column.
hex_decode usage decreased from 15.5% to 5.5%, a 5-8% query level
improvement depending on which storage type is used.
------
Chiranmoy
Вложения
On Thu, Sep 04, 2025 at 02:55:50PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: >> I see that there was some discussion about a Neon implementation upthread, >> but I'm not sure we concluded anything. For popcount, we first added a >> Neon version before adding the SVE version, which required more complicated >> configure/runtime checks. Presumably Neon is available on more hardware >> than SVE, so that could be a good place to start here, too. > > We have added the Neon versions of hex encode/decode. Thanks. I noticed that this stuff is simple enough that we can use port/simd.h (with a few added functions). This is especially nice because it takes care of x86, too. The performance gains look similar to what you reported for v6: arm buf | HEAD | patch | % diff -------+-------+-------+-------- 16 | 13 | 6 | 54 64 | 34 | 9 | 74 256 | 93 | 25 | 73 1024 | 281 | 78 | 72 4096 | 1086 | 227 | 79 16384 | 4382 | 927 | 79 65536 | 17455 | 3608 | 79 x86 buf | HEAD | patch | % diff -------+-------+-------+-------- 16 | 10 | 7 | 30 64 | 29 | 9 | 69 256 | 81 | 21 | 74 1024 | 286 | 66 | 77 4096 | 1106 | 253 | 77 16384 | 4383 | 980 | 78 65536 | 17491 | 3886 | 78 I've only modified hex_encode() for now, but I'm optimistic that we can do something similar for hex_decode(). -- nathan
Вложения
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
> Thanks. I noticed that this stuff is simple enough that we can use
> port/simd.h (with a few added functions). This is especially nice because
> it takes care of x86, too. The performance gains look similar to what you
> reported for v6:
> port/simd.h (with a few added functions). This is especially nice because
> it takes care of x86, too. The performance gains look similar to what you
> reported for v6:
This looks good, much cleaner.
One possible improvement would be to use a vectorized table lookup instead of compare and add. I compared v6 and v7 Neon versions, and v6 is always faster.
I’m not sure if SSE2 has a table lookup similar to Neon.
arm - m7g.4xlarge
buf | v6-Neon| v7-Neon| % diff
-------+--------+--------+--------
64 | 6.16 | 8.57 | 28.07
128 | 11.37 | 15.77 | 27.87
256 | 18.54 | 30.28 | 38.77
512 | 33.98 | 62.15 | 45.33
1024 | 64.46 | 117.55 | 45.16
2048 | 124.28 | 254.86 | 51.24
4096 | 243.47 | 509.23 | 52.19
8192 | 487.34 | 953.81 | 48.91
-----
Chiranmoy
Chiranmoy
On Thu, Sep 11, 2025 at 10:43:56AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: > One possible improvement would be to use a vectorized table lookup > instead of compare and add. I compared v6 and v7 Neon versions, and v6 is > always faster. I’m not sure if SSE2 has a table lookup similar to Neon. I'm not finding a simple way to do that kind of table lookup in SSE2. Part of the reason v6 performs better is because you've unrolled the loop to process 2 vector's worth of input data in each iteration. This trades performance with smaller inputs for gains with larger ones. But even if I do something similar for v7, v6 still wins most of the time. My current philosophy with this stuff is to favor simplicity, maintainability, portability, etc. over extracting the absolute maximum amount of performance gain, so I think we should proceed with the simd.h approach. But I'm curious how others feel about this. v8 is an attempt to fix the casting error on MSVC. -- nathan
Вложения
Nathan Bossart <nathandbossart@gmail.com> writes: > My current philosophy with this stuff is to favor simplicity, > maintainability, portability, etc. over extracting the absolute maximum > amount of performance gain, so I think we should proceed with the simd.h > approach. But I'm curious how others feel about this. +1. The maintainability aspect is critical over the long run. Also, there's a very real danger of optimizing for the specific hardware and test case you are working with, leading to actually worse performance with future hardware. regards, tom lane
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
От
"Chiranmoy.Bhattacharya@fujitsu.com"
Дата:
> My current philosophy with this stuff is to favor simplicity,
> maintainability, portability, etc. over extracting the absolute maximum
> amount of performance gain, so I think we should proceed with the simd.h
> approach. But I'm curious how others feel about this.
> +1. The maintainability aspect is critical over the long run.
> Also, there's a very real danger of optimizing for the specific
> hardware and test case you are working with, leading to actually
> worse performance with future hardware.
> maintainability, portability, etc. over extracting the absolute maximum
> amount of performance gain, so I think we should proceed with the simd.h
> approach. But I'm curious how others feel about this.
> +1. The maintainability aspect is critical over the long run.
> Also, there's a very real danger of optimizing for the specific
> hardware and test case you are working with, leading to actually
> worse performance with future hardware.
Using simd.h does make it easier to maintain.
Is there a plan to upgrade simd.h to use SSE4 or SSSE3 in the future?
Since SSE2 is much older, it lacks some of the more specialized intrinsics.
For example, vectorized table lookup can be implemented via [0], and
it’s available in SSSE3 and later x86 instruction sets.
-----
Chiranmoy
On Fri, Sep 12, 2025 at 06:49:01PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: > Using simd.h does make it easier to maintain. Is there a plan to upgrade > simd.h to use SSE4 or SSSE3 in the future? Since SSE2 is much older, it > lacks some of the more specialized intrinsics. For example, vectorized > table lookup can be implemented via [0], and it’s available in SSSE3 and > later x86 instruction sets. There have been a couple of discussions about the possibility of requiring x86-64-v2 for Postgres, but I'm not aware of any serious efforts in that area. I've attached a new version of the patch with a simd.h version of hex_decode(). Here are the numbers: arm buf | HEAD | patch | % diff -------+-------+-------+-------- 16 | 22 | 23 | -5 64 | 61 | 23 | 62 256 | 158 | 47 | 70 1024 | 542 | 122 | 77 4096 | 2103 | 429 | 80 16384 | 8548 | 1673 | 80 65536 | 34663 | 6738 | 81 x86 buf | HEAD | patch | % diff -------+-------+-------+-------- 16 | 13 | 14 | -8 64 | 42 | 15 | 64 256 | 126 | 42 | 67 1024 | 461 | 149 | 68 4096 | 1802 | 576 | 68 16384 | 7166 | 2280 | 68 65536 | 28625 | 9108 | 68 A couple of notes: * For hex_decode(), we just give up on the SIMD path and fall back on the scalar path as soon as we see anything outside [0-9A-Za-z]. I suspect this might introduce a regression for inputs of ~32 to ~64 bytes that include whitespace (which must be skipped) or invalid characters, but I don't whether those inputs are common or whether we care. * The code makes some assumptions about endianness that might not be true everywhere, but I've yet to dig into this further. -- nathan
Вложения
On Fri, Sep 12, 2025 at 04:30:21PM -0500, Nathan Bossart wrote: > I've attached a new version of the patch with a simd.h version of > hex_decode(). Here are the numbers: I was able to improve the hex_decode() implementation a bit. arm buf | HEAD | patch | % diff -------+-------+-------+-------- 16 | 11 | 11 | 0 64 | 38 | 7 | 82 256 | 133 | 18 | 86 1024 | 513 | 67 | 87 4096 | 2037 | 271 | 87 16384 | 8326 | 1103 | 87 65536 | 34550 | 4475 | 87 x86 buf | HEAD | patch | % diff -------+-------+-------+-------- 16 | 8 | 9 | -13 64 | 38 | 7 | 82 256 | 121 | 24 | 80 1024 | 457 | 91 | 80 4096 | 1797 | 356 | 80 16384 | 7161 | 1411 | 80 65536 | 28620 | 5632 | 80 -- nathan
Вложения
On Mon, Sep 22, 2025 at 03:05:44PM -0500, Nathan Bossart wrote: > I was able to improve the hex_decode() implementation a bit. I took a closer look at how hex_decode() performs with smaller inputs. There are some small regressions, so I tried fixing them by adding the following to the beginning of the function: if (likely(tail_idx == 0)) return hex_decode_safe_scalar(src, len, dst, escontext); This helped a little, but it mostly just slowed things down for larger inputs on AArch64: arm buf | HEAD | patch | fix -------+-------+-------+------- 2 | 4 | 6 | 4 4 | 6 | 7 | 7 8 | 8 | 8 | 8 16 | 11 | 12 | 11 32 | 18 | 5 | 6 64 | 38 | 7 | 8 256 | 134 | 18 | 24 1024 | 514 | 67 | 100 4096 | 2072 | 280 | 389 16384 | 8409 | 1126 | 1537 65536 | 34704 | 4498 | 6128 x86 buf | HEAD | patch | fix -------+-------+-------+------- 2 | 2 | 2 | 2 4 | 3 | 3 | 3 8 | 4 | 4 | 4 16 | 8 | 9 | 8 32 | 23 | 5 | 5 64 | 37 | 7 | 7 256 | 122 | 24 | 24 1024 | 457 | 91 | 92 4096 | 1798 | 357 | 358 16384 | 7161 | 1411 | 1416 65536 | 28621 | 5630 | 5653 I didn't do this test for hex_encode(), but I'd expect it to follow a similar pattern. I'm tempted to suggest that these regressions are within tolerable levels and to forge on with v10. In any case, IMHO this patch is approaching committable quality, so I'd be grateful for any feedback. -- nathan
On Wed, Sep 24, 2025 at 2:02 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Sep 22, 2025 at 03:05:44PM -0500, Nathan Bossart wrote: > > I was able to improve the hex_decode() implementation a bit. > > I took a closer look at how hex_decode() performs with smaller inputs. > There are some small regressions, so I tried fixing them by adding the > following to the beginning of the function: > > if (likely(tail_idx == 0)) > return hex_decode_safe_scalar(src, len, dst, escontext); > > This helped a little, but it mostly just slowed things down for larger > inputs on AArch64: > I didn't do this test for hex_encode(), but I'd expect it to follow a > similar pattern. I'm tempted to suggest that these regressions are within > tolerable levels and to forge on with v10. My first thought is, I'd hazard a guess that short byteas are much less common than short strings. My second thought is, the decode case is not that critical. From the end-to-end tests above, the speed of the decode case had a relatively small global effect compared to the encode case (Perhaps because reads are cheaper than writes). + if (unlikely(!hex_decode_simd_helper(srcv, &dstv1))) + break; But if you really want to do something here, sprinkling "(un)likely"'s here seems like solving the wrong problem (even if they make any difference), since the early return is optimizing for exceptional conditions. In other places (cf. the UTF8 string verifier), we accumulate errors, and only if we have them at the end do we restart from the beginning with the slow error-checking path that can show the user the offending input. > In any case, IMHO this patch is > approaching committable quality, so I'd be grateful for any feedback. +vector8_sssub(const Vector8 v1, const Vector8 v2) It's hard to parse "sss", so maybe we can borrow an Intel-ism and use "iss" for the signed case? +/* vector manipulation */ +#ifndef USE_NO_SIMD +static inline Vector8 vector8_interleave_low(const Vector8 v1, const Vector8 v2); +static inline Vector8 vector8_interleave_high(const Vector8 v1, const Vector8 v2); +static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2); +static inline Vector32 vector32_shift_left_nibble(const Vector32 v1); +static inline Vector32 vector32_shift_right_nibble(const Vector32 v1); +static inline Vector32 vector32_shift_right_byte(const Vector32 v1); Do we need declarations for these? I recall that the existing declarations are there for functions that are also used internally. The nibble/byte things are rather specific. Wouldn't it be more logical to expose the already-generic shift operations and let the caller say by how much? Or does the compiler refuse because the intrinsic doesn't get an immediate value? Some are like that, but I'm not sure about these. If so, that's annoying and I wonder if there's a workaround. +vector8_has_ge(const Vector8 v, const uint8 c) +{ +#ifdef USE_SSE2 + Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c)); + Vector8 cmpe = _mm_cmpeq_epi8(umax, v); + + return vector8_is_highbit_set(cmpe); We take pains to avoid signed comparison on unsigned input for the "le" case, and I don't see why it's okay here. Do the regression tests have long enough cases that test exceptional paths, like invalid bytes and embedded whitespace? If not, we need some. -- John Naylor Amazon Web Services
On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote: > + if (unlikely(!hex_decode_simd_helper(srcv, &dstv1))) > + break; > > But if you really want to do something here, sprinkling "(un)likely"'s > here seems like solving the wrong problem (even if they make any > difference), since the early return is optimizing for exceptional > conditions. In other places (cf. the UTF8 string verifier), we > accumulate errors, and only if we have them at the end do we restart > from the beginning with the slow error-checking path that can show the > user the offending input. I switched to an accumulator approach in v11. > +vector8_sssub(const Vector8 v1, const Vector8 v2) > > It's hard to parse "sss", so maybe we can borrow an Intel-ism and use > "iss" for the signed case? Done. > +/* vector manipulation */ > +#ifndef USE_NO_SIMD > +static inline Vector8 vector8_interleave_low(const Vector8 v1, const > Vector8 v2); > +static inline Vector8 vector8_interleave_high(const Vector8 v1, const > Vector8 v2); > +static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2); > +static inline Vector32 vector32_shift_left_nibble(const Vector32 v1); > +static inline Vector32 vector32_shift_right_nibble(const Vector32 v1); > +static inline Vector32 vector32_shift_right_byte(const Vector32 v1); > > Do we need declarations for these? I recall that the existing > declarations are there for functions that are also used internally. Removed. > The nibble/byte things are rather specific. Wouldn't it be more > logical to expose the already-generic shift operations and let the > caller say by how much? Or does the compiler refuse because the > intrinsic doesn't get an immediate value? Some are like that, but I'm > not sure about these. If so, that's annoying and I wonder if there's a > workaround. Yeah, the compiler refuses unless the value is an integer literal. I thought of using a switch statement to cover all the values used in-tree, but I didn't like that, either. > +vector8_has_ge(const Vector8 v, const uint8 c) > +{ > +#ifdef USE_SSE2 > + Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c)); > + Vector8 cmpe = _mm_cmpeq_epi8(umax, v); > + > + return vector8_is_highbit_set(cmpe); > > We take pains to avoid signed comparison on unsigned input for the > "le" case, and I don't see why it's okay here. _mm_max_epu8() does unsigned comparisons, I think... > Do the regression tests have long enough cases that test exceptional > paths, like invalid bytes and embedded whitespace? If not, we need > some. Added. I've also fixed builds on gcc/arm64, as discussed elsewhere [0]. Here are the current numbers on my laptop: arm buf | HEAD | patch | % diff -------+-------+-------+-------- 2 | 4 | 4 | 0 4 | 6 | 6 | 0 8 | 8 | 8 | 0 16 | 11 | 12 | -9 32 | 18 | 5 | 72 64 | 38 | 6 | 84 256 | 134 | 17 | 87 1024 | 513 | 63 | 88 4096 | 2081 | 262 | 87 16384 | 8524 | 1058 | 88 65536 | 34731 | 4224 | 88 [0] https://postgr.es/m/aNQtN89VW8z-yo3B%40nathan -- nathan
Вложения
On Thu, Sep 25, 2025 at 4:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote: > > + if (unlikely(!hex_decode_simd_helper(srcv, &dstv1))) > > + break; > > > > But if you really want to do something here, sprinkling "(un)likely"'s > > here seems like solving the wrong problem (even if they make any > > difference), since the early return is optimizing for exceptional > > conditions. In other places (cf. the UTF8 string verifier), we > > accumulate errors, and only if we have them at the end do we restart > > from the beginning with the slow error-checking path that can show the > > user the offending input. > > I switched to an accumulator approach in v11. Looks good to me. + if (unlikely(!success)) + i = 0; This is after the main loop exits, and the cold path is literally one instruction, so the motivation is not apparent to me. > > The nibble/byte things are rather specific. Wouldn't it be more > > logical to expose the already-generic shift operations and let the > > caller say by how much? Or does the compiler refuse because the > > intrinsic doesn't get an immediate value? Some are like that, but I'm > > not sure about these. If so, that's annoying and I wonder if there's a > > workaround. > > Yeah, the compiler refuses unless the value is an integer literal. I > thought of using a switch statement to cover all the values used in-tree, > but I didn't like that, either. Neither option is great, but I mildly lean towards keeping it internal with "switch" or whatever: By putting the burden of specifying shift amounts on separately named functions we run a risk of combinatorial explosion in function names. If you feel otherwise, I'd at least use actual numbers: "shift_left_nibble" is an awkward way to say "shift left by 4 bits" anyway, and also after "byte" and "nibble" there are not many good English words to convey the operand amount. It's very possible that needing other shift amounts will never come up, though. > > +vector8_has_ge(const Vector8 v, const uint8 c) > > +{ > > +#ifdef USE_SSE2 > > + Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c)); > > + Vector8 cmpe = _mm_cmpeq_epi8(umax, v); > > + > > + return vector8_is_highbit_set(cmpe); > > > > We take pains to avoid signed comparison on unsigned input for the > > "le" case, and I don't see why it's okay here. > > _mm_max_epu8() does unsigned comparisons, I think... Ah, I confused myself about what the LE case was avoiding, namely signed LE, not signed equality on something else. (Separately, now I'm wondering if we can do the same for vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that would allow getting rid of ) > > Do the regression tests have long enough cases that test exceptional > > paths, like invalid bytes and embedded whitespace? If not, we need > > some. > > Added. Seems comprehensive enough at a glance. Other comments: + * back together to form the final hex-encoded string. It might be + * possible to squeeze out a little more gain by manually unrolling the + * loop, but for now we don't bother. My position (and I think the community agrees) is that manual unrolling is a rare desperation move that has to be justified, so we don't need to mention its lack. + * Some compilers are picky about casts to the same underlying type, and others + * are picky about implicit conversions with vector types. This function does + * the same thing as vector32_broadcast(), but it returns a Vector8 and is + * carefully crafted to avoid compiler indigestion. + */ +#ifndef USE_NO_SIMD +static inline Vector8 +vector8_broadcast_u32(const uint32 c) +{ +#ifdef USE_SSE2 + return vector32_broadcast(c); +#elif defined(USE_NEON) + return (Vector8) vector32_broadcast(c); +#endif +} I'm ambivalent about this: The use case doesn't seem well motivated, since I don't know why we'd actually need to both broadcast arbitrary integers and also view the result as bytes. Setting arbitrary bytes is what we're really doing, and would be more likely be useful in the future (attached, only tested on x86, and I think part of the strangeness is the endianness you mentioned above). On the other hand, the Arm workaround results in awful generated code compared to what you have here. Since the "set" should be hoisted out of the outer loop, and we already rely on this pattern for vector8_highbit_mask anyway, it might be tolerable, and we can reduce the pain with bitwise NOT. +/* + * Pack 16-bit elements in the given vectors into a single vector of 8-bit + * elements. NB: The upper 8-bits of each 16-bit element must be zeros, else + * this will produce different results on different architectures. + */ v10 asserted this requirement -- that still seems like a good thing? -- John Naylor Amazon Web Services
Вложения
On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote: > + if (unlikely(!success)) > + i = 0; > > This is after the main loop exits, and the cold path is literally one > instruction, so the motivation is not apparent to me. Removed. I was thinking about smaller inputs when I added this, but it probably makes little difference. >> Yeah, the compiler refuses unless the value is an integer literal. I >> thought of using a switch statement to cover all the values used in-tree, >> but I didn't like that, either. > > Neither option is great, but I mildly lean towards keeping it internal > with "switch" or whatever: By putting the burden of specifying shift > amounts on separately named functions we run a risk of combinatorial > explosion in function names. Done. > (Separately, now I'm wondering if we can do the same for > vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that > would allow getting rid of ) I think so. I doubt there's any performance advantage, but it could be nice for code cleanup. (I'm assuming you meant to say vector8_ssub (renamed to vector8_ussub() in the patch) after "getting rid of.") I'll do this in the related patch in the "couple of small patches for simd.h" thread. > + * back together to form the final hex-encoded string. It might be > + * possible to squeeze out a little more gain by manually unrolling the > + * loop, but for now we don't bother. > > My position (and I think the community agrees) is that manual > unrolling is a rare desperation move that has to be justified, so we > don't need to mention its lack. Removed. > + * Some compilers are picky about casts to the same underlying type, and others > + * are picky about implicit conversions with vector types. This function does > + * the same thing as vector32_broadcast(), but it returns a Vector8 and is > + * carefully crafted to avoid compiler indigestion. > + */ > +#ifndef USE_NO_SIMD > +static inline Vector8 > +vector8_broadcast_u32(const uint32 c) > +{ > +#ifdef USE_SSE2 > + return vector32_broadcast(c); > +#elif defined(USE_NEON) > + return (Vector8) vector32_broadcast(c); > +#endif > +} > > I'm ambivalent about this: The use case doesn't seem well motivated, > since I don't know why we'd actually need to both broadcast arbitrary > integers and also view the result as bytes. Setting arbitrary bytes is > what we're really doing, and would be more likely be useful in the > future (attached, only tested on x86, and I think part of the > strangeness is the endianness you mentioned above). On the other hand, > the Arm workaround results in awful generated code compared to what > you have here. Since the "set" should be hoisted out of the outer > loop, and we already rely on this pattern for vector8_highbit_mask > anyway, it might be tolerable, and we can reduce the pain with bitwise > NOT. I think I disagree on this one. We're not broadcasting arbitrary bytes for every vector element, we're broadcasting a patten of bytes that happens to be wider than the element size. I would expect this to be a relatively common use-case. Furthermore, the "set" API is closely tethered to the vector size, which is fine for SSE2/Neon but may not work down the road (not to mention the USE_NO_SIMD path). Also, the bitwise NOT approach won't work because we need to use 0x0f000f00 and 0x000f000f to avoid angering the assertion in vector8_pack_16(), as mentioned below. > +/* > + * Pack 16-bit elements in the given vectors into a single vector of 8-bit > + * elements. NB: The upper 8-bits of each 16-bit element must be zeros, else > + * this will produce different results on different architectures. > + */ > > v10 asserted this requirement -- that still seems like a good thing? I had removed that because I worried the accumulator approach would cause it to fail (it does), but looking again, that's easy enough to work around. -- nathan
Вложения
On Fri, Sep 26, 2025 at 1:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote: > > (Separately, now I'm wondering if we can do the same for > > vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that > > would allow getting rid of ) > > I think so. I doubt there's any performance advantage, but it could be > nice for code cleanup. (I'm assuming you meant to say vector8_ssub > (renamed to vector8_ussub() in the patch) after "getting rid of.") Yes right, sorry. And it seems good to do such cleanup first, since it doesn't make sense to rename something that is about to be deleted. > I think I disagree on this one. We're not broadcasting arbitrary bytes for > every vector element, we're broadcasting a patten of bytes that happens to > be wider than the element size. I would expect this to be a relatively > common use-case. That's probably true. I'm still worried that the hack for working around compiler pickiness (while nice enough in it's current form) might break at some point and require awareness of compiler versions. Hmm, for this case, we can sidestep the maintainability questions entirely by instead using the new interleave functions to build the masks: vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f)) vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero()) This generates identical code as v12 on Arm and is not bad on x86. What do you think of the attached? While looking around again, it looks like the "msk" variable isn't a mask like the implies to me. Not sure of a better name because I'm not sure what it represents aside from a temp variable. +#elif defined(USE_NEON) + switch (i) + { + case 4: + return (Vector8) vshrq_n_u32((Vector32) v1, 4); + case 8: + return (Vector8) vshrq_n_u32((Vector32) v1, 8); + default: + pg_unreachable(); + return vector8_broadcast(0); + } This is just a compiler hint, so if the input is not handled I think it will return the wrong answer rather than alerting the developer, so we probabaly want "Assert(false)" here. Other than that, the pack/unpack functions could use some documentation about which parameter is low/high. -- John Naylor Amazon Web Services
Вложения
On Mon, Sep 29, 2025 at 03:45:27PM +0700, John Naylor wrote: > On Fri, Sep 26, 2025 at 1:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote: >> > (Separately, now I'm wondering if we can do the same for >> > vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that >> > would allow getting rid of ) >> >> I think so. I doubt there's any performance advantage, but it could be >> nice for code cleanup. (I'm assuming you meant to say vector8_ssub >> (renamed to vector8_ussub() in the patch) after "getting rid of.") > > Yes right, sorry. And it seems good to do such cleanup first, since it > doesn't make sense to rename something that is about to be deleted. Will do. I'll plan on committing the other patch [0] soon. > Hmm, for this case, we can sidestep the maintainability questions > entirely by instead using the new interleave functions to build the > masks: > > vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f)) > vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero()) > > This generates identical code as v12 on Arm and is not bad on x86. > What do you think of the attached? WFM > While looking around again, it looks like the "msk" variable isn't a > mask like the implies to me. Not sure of a better name because I'm not > sure what it represents aside from a temp variable. Renamed to "tmp". > +#elif defined(USE_NEON) > + switch (i) > + { > + case 4: > + return (Vector8) vshrq_n_u32((Vector32) v1, 4); > + case 8: > + return (Vector8) vshrq_n_u32((Vector32) v1, 8); > + default: > + pg_unreachable(); > + return vector8_broadcast(0); > + } > > This is just a compiler hint, so if the input is not handled I think > it will return the wrong answer rather than alerting the developer, so > we probabaly want "Assert(false)" here. Fixed. > Other than that, the pack/unpack functions could use some > documentation about which parameter is low/high. Added. [0] https://postgr.es/m/attachment/182185/v3-0001-Optimize-vector8_has_le-on-AArch64.patch -- nathan