Обсуждение: Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

Поиск
Список
Период
Сортировка

Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

От
John Naylor
Дата:
On Wed, May 7, 2025 at 8:04 PM Eduard Stefes <Eduard.Stefes@ibm.com> wrote:
>
> Hi,
>
> Here I send a patch that adds a vectorized version of CRC32C for the
> IBM S390X hardware. I kindly ask for a review of the code and to pick
> it up in upstream postgresql.

Thanks!

> # Why this patch:
>
> We noticed that postgres running on an S390X will spend much longer in
> CRC32C as compared to other platform with optimized crc32c.

Right. That's also true of PowerPC -- perhaps that's a different team
than yours?

> Hendrik Brueckner would allow us to re-license his implementation of an
> optimized crc32c under postgres-license. This implementation is already
> used in gzip, zlib-ng and the linux kernel.

That's good news, although the copyright may need discussion. Please
see the three entries starting here:

https://wiki.postgresql.org/wiki/Developer_FAQ#Do_I_need_to_sign_a_copyright_assignment.3F

...and if you have questions ask us.

> # The optimized CRC32C:
>
> The IBM S390X platform has no dedicated CRC infrastructure. The
> algorithm works by using >>reduction constants to fold and process
> particular chunks of the input data stream in parallel.<< This makes
> grate use of the S390X vector units. Depending on the size of the input
> stream a speedup in the order of magnitude can be achieved(compared to
> sb8).

Okay, we are familiar with this technique as applied to AVX-512 and
Arm Crypto extensions.

> # Runtime checks:
>
> The runtime detection strategy follows the same approach as the ARM
> code. If the code is compiled with all needed flags enabled the runtime
> detection will not be compiled in. If the build system can enable all
> needed flags itself, it will also enable runtime detection.

This case is a bit different, since Arm can compute hardware CRC on
any input size. The fast path here is only guaranteed to be taken at
inputs of 79 bytes or bigger, with the 64-byte main loop and 16-byte
alignment. For larger inputs, an indirect call shouldn't matter, so to
keep it simple I'd recommend having only the runtime check.  And for
smaller inputs call sb8 directly -- this will avoid the indirect call
which in turn just jumps to sb8. This is important because the server
computes CRC on a 20-byte input for the WAL record header while
holding a lock. Something like:

#define COMP_CRC32C(crc, data, len)  \
  ((crc) = pg_comp_crc32c_dispatch((crc), (data), (len)))

static inline
pg_crc32c
pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len)
{
  /*
  if (len < VX_MIN_LEN + VX_ALIGN_MASK)
  {
    /*
     * For inputs small enough that we may not take the vector path,
     * just call sb8 directly.
     */
    return pg_comp_crc32c_sb8(crc, data, len);;
  }
  else
    /* Otherwise, use a runtime check for S390_VX instructions. */
    return pg_comp_crc32c(crc, data, len);
}

> # The glue code:
>
> I tried to follow the postgres coding conventions. I ran
> `./pg_bsd_indent -i4 -l79 -di12 -nfc1 -nlp -sac ...` as mentioned in
> src/tools/pg_bsd_indent/README. But for me this will absolutely not
> format code according to the postgres coding convention. Therefor I
> formatted everything by hand.

The entry point for that is a perl script, so you can invoke it on the
relevant directories like so:

src/tools/pgindent/pgindent src/port/ src/include/port

> I feared that simply writing a function pointer in a software spawning
> many threads and forks might cause issues. So i decided to use
> `__atomic_store_n` to set the CRC function pointer. Indeed I noticed
> that the other _choose.c files did not do this. However I am very
> confident that `__atomic_store_n` will always be available on a S390X.

If this behaves like Linux on other platforms, it'll be copy-on-write
when forking, and only the children will set this variable on first
call. Is there some documented reason we need special treatment here?

> As this is the first time I am writing m4/autotools, I'd kindly ask the
> reviewer for special care there :) . There may be dragons. But I have
> high hopes all is OK.

I haven't taken a close look, but here are a couple things I noticed
at a glance:

+# PGAC_S390X_BAD_VECTOR_INTRINSICS

Does the build still fall back to sb8 with a warning, or does it fail?
It'd simpler if the guard against certain clang versions went into the
choose function, so it can fall back to sb8.

-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
+# all versions of pg_crc32c_armv8.o and pg_crc32c_s390x.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_s390x.o: CFLAGS+=$(CFLAGS_CRC)

It seems we have the same three *.o objects on S390X, for example:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=siskin&dt=2025-05-07%2019%3A08%3A40&stg=build

...so all should have the same additional CFLAGS.

+  prog = '''
+#include <vecintrin.h>
+int main(void) {
+    // test for vector extension
+    unsigned long long a __attribute__((vector_size(16))) = { 0 };
+    unsigned long long b __attribute__((vector_size(16))) = { 0 };
+    unsigned char c __attribute__((vector_size(16))) = { 0 };
+    c = vec_gfmsum_accum_128(a, b, c);
+    return c[0];
+}'''

We prefer that 'a' and 'b' are declared as global variables, just to
make it as realistic as possible, although it doesn't seem to make
much difference when I tried it on Compiler Explorer. (Same for
autoconf)

While playing around with that, the above doesn't compile on gcc 14,
but 13 and 15 work -- maybe a bug report is in order?

Also, if we can assume the existence of the other necessary vector
instructions if the above exists, that would be good to put in a
comment.

--
John Naylor
Amazon Web Services



John Naylor <johncnaylorls@gmail.com> writes:
> We prefer that 'a' and 'b' are declared as global variables, just to
> make it as realistic as possible, although it doesn't seem to make
> much difference when I tried it on Compiler Explorer. (Same for
> autoconf)

Yeah, see commit fdb5dd6331e305f797bb589747f056062c305f0b for
some recent precedent, and for explanation of why we think it's
worth worrying about.  It's not so much that "global" matters,
it's that we don't want the compiler to have the option to
fold the whole test to a constant, which it would be within
its rights to do as you have this.

            regards, tom lane



RE: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

От
Eduard Stefes
Дата:
On Thu, 2025-05-08 at 05:23 +0700, John Naylor wrote:
> Right. That's also true of PowerPC -- perhaps that's a different team
> than yours?

indeed that's another team. I can ping them but there is not much more
I can do.

> This case is a bit different, since Arm can compute hardware CRC on
> any input size. The fast path here is only guaranteed to be taken at
> inputs of 79 bytes or bigger, with the 64-byte main loop and 16-byte
> alignment. For larger inputs, an indirect call shouldn't matter, so
> to
> keep it simple I'd recommend having only the runtime check.  And for
> smaller inputs call sb8 directly -- this will avoid the indirect call
> which in turn just jumps to sb8. This is important because the server
> computes CRC on a 20-byte input for the WAL record header while
> holding a lock. 

I worked on the code and got it working on 16 byte chunks so the WAL
writes will directly benefit from this. I will try to calculate and add
the polynomials for the smaller chunks. Maybe we where wrong and we
will still see a speed improvement, also for the smaller pieces.
However that is something that I cannot tackle immediately. I'll come
back to this in the summer.  


> Something like:
> 
> #define COMP_CRC32C(crc, data, len)  \
>   ((crc) = pg_comp_crc32c_dispatch((crc), (data), (len)))
> 
> static inline
> pg_crc32c
> pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len)
> {
>   /*
>   if (len < VX_MIN_LEN + VX_ALIGN_MASK)
>   {
>     /*
>      * For inputs small enough that we may not take the vector path,
>      * just call sb8 directly.
>      */
>     return pg_comp_crc32c_sb8(crc, data, len);;
>   }
>   else
>     /* Otherwise, use a runtime check for S390_VX instructions. */
>     return pg_comp_crc32c(crc, data, len);
> }

Is static inline generally preferred here or should I do something
like:

#define COMP_CRC32C(crc, data, len)         \
  ((crc) = (len) < MAX_S390X_CRC ?           \
    pg_comp_crc32c_sb8((crc),(data),(len)) : \
    pg_comp_crc32c((crc), (data), (len)))

> If this behaves like Linux on other platforms, it'll be copy-on-write
> when forking, and only the children will set this variable on first
> call. Is there some documented reason we need special treatment here?

yes for forks i think there is no problem. But I don't know what will
happen when two cpus will try to read/set the value. The value is
definitely not updated across both CPU caches. So my guess would be
that there is a high chance that the _choose code is called multiple
times(IF there really are multiple threads trying to calculate a crc at
the same time AND the pointer is not already initialized).

Indeed, because it makes no difference if the code is executed multiple
times its totally fine to not use atomic store here. So I'll remove the
code and the checks from the build system again.

> Does the build still fall back to sb8 with a warning, or does it
> fail?
> It'd simpler if the guard against certain clang versions went into
> the
> choose function, so it can fall back to sb8.

My current implementation will silently fall back to sb8 and not
compile any of the s390x code. 

Other possible strategies are:

- print a warning and fall back to sb0
- fail the compile and inform the user to switch compiler or to disable
crc32vx in the build system
- move all the checks into the .c file and to this with with macros

indeed for zlib-ng we went the way to put all the checks into the code
and fail the build if compiled with a known broken compiler. 

My arguments to not do the same in postgres are:

- postgres uses sb8 on s390x already so by NOT changing it to another
algorithm we do not change the expected behavior.
- imho, this is something that the build system should check and take
care of, not the code that's getting compiled.

What is the preferred postgres way here?

> While playing around with that, the above doesn't compile on gcc 14,
> but 13 and 15 work -- maybe a bug report is in order?

Thy for reporting I'll look into that!


Many thanks for all other comments. I'll work on the code to
incorporate all remarks.


-- 
Eduard Stefes <eduard.stefes@ibm.com>

Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

От
John Naylor
Дата:
On Sat, May 31, 2025 at 2:41 AM Eduard Stefes <Eduard.Stefes@ibm.com> wrote:
>
> On Thu, 2025-05-08 at 05:23 +0700, John Naylor wrote:
>
> > This case is a bit different, since Arm can compute hardware CRC on
> > any input size. The fast path here is only guaranteed to be taken at
> > inputs of 79 bytes or bigger, with the 64-byte main loop and 16-byte
> > alignment. For larger inputs, an indirect call shouldn't matter, so
> > to
> > keep it simple I'd recommend having only the runtime check.  And for
> > smaller inputs call sb8 directly -- this will avoid the indirect call
> > which in turn just jumps to sb8. This is important because the server
> > computes CRC on a 20-byte input for the WAL record header while
> > holding a lock.
>
> I worked on the code and got it working on 16 byte chunks so the WAL
> writes will directly benefit from this. I will try to calculate and add
> the polynomials for the smaller chunks. Maybe we where wrong and we
> will still see a speed improvement, also for the smaller pieces.
> However that is something that I cannot tackle immediately. I'll come
> back to this in the summer.

The measurements from your previous message used 16-byte chunks -- did
that have the necessary constants to get the correct answer?

I'm not sure what you're in doubt about, but I guess it will be clear
with your next patch.

> > Something like:
> >
> > #define COMP_CRC32C(crc, data, len)  \
> >   ((crc) = pg_comp_crc32c_dispatch((crc), (data), (len)))
> >
> > static inline
> > pg_crc32c
> > pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len)
> > {
> >   /*
> >   if (len < VX_MIN_LEN + VX_ALIGN_MASK)
> >   {
> >     /*
> >      * For inputs small enough that we may not take the vector path,
> >      * just call sb8 directly.
> >      */
> >     return pg_comp_crc32c_sb8(crc, data, len);;
> >   }
> >   else
> >     /* Otherwise, use a runtime check for S390_VX instructions. */
> >     return pg_comp_crc32c(crc, data, len);
> > }
>
> Is static inline generally preferred here or should I do something
> like:
>
> #define COMP_CRC32C(crc, data, len)          \
>   ((crc) = (len) < MAX_S390X_CRC ?           \
>     pg_comp_crc32c_sb8((crc),(data),(len)) : \
>     pg_comp_crc32c((crc), (data), (len)))

That macro is actually better -- the static inline was used for x86
because it has looping which is not the case for S390X.

> > Does the build still fall back to sb8 with a warning, or does it
> > fail?
> > It'd simpler if the guard against certain clang versions went into
> > the
> > choose function, so it can fall back to sb8.
>
> My current implementation will silently fall back to sb8 and not
> compile any of the s390x code.

That seems like a good design.

> Other possible strategies are:
>
> - print a warning and fall back to sb0
> - fail the compile and inform the user to switch compiler or to disable
> crc32vx in the build system
> - move all the checks into the .c file and to this with with macros
>
> indeed for zlib-ng we went the way to put all the checks into the code
> and fail the build if compiled with a known broken compiler.
>
> My arguments to not do the same in postgres are:
>
> - postgres uses sb8 on s390x already so by NOT changing it to another
> algorithm we do not change the expected behavior.
> - imho, this is something that the build system should check and take
> care of, not the code that's getting compiled.

I'm inclined to agree, but I wonder if these guards should go inside
the link check. I.e. the separate config machinery for "have bad
compiler" seems like unnecessary clutter.

--
John Naylor
Amazon Web Services