Re: [POC] verifying UTF-8 using SIMD instructions
От | Thomas Munro |
---|---|
Тема | Re: [POC] verifying UTF-8 using SIMD instructions |
Дата | |
Msg-id | CA+hUKGJjyXvS6W05kRVpH6Kng50=uOGxyiyjgPKm707JxQYHCg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [POC] verifying UTF-8 using SIMD instructions (John Naylor <john.naylor@enterprisedb.com>) |
Ответы |
Re: [POC] verifying UTF-8 using SIMD instructions
|
Список | pgsql-hackers |
On Thu, Jul 22, 2021 at 6:16 AM John Naylor <john.naylor@enterprisedb.com> wrote: > Neat! It's good to make it more architecture-agnostic, and I'm sure we can use quite a bit of this. One question is whether this "one size fits all" approach will be extensible to wider SIMD. > to_bool(const pg_u8x16_t v) > { > +#if defined(USE_NEON) > + return vmaxvq_u32((uint32x4_t) v) != 0; > > --> return vmaxvq_u8(*this) != 0; I chose that lane width because I saw an unsubstantiated claim somewhere that it might be faster, but I have no idea if it matters. The u8 code looks more natural anyway. Changed. > vzero() > { > +#if defined(USE_NEON) > + return vmovq_n_u8(0); > > --> return vdupq_n_u8(0); // or equivalently, splat(0) I guess it doesn't make a difference which builtin you use here, but I was influenced by the ARM manual which says the vdupq form is generated for immediate values. > is_highbit_set(const pg_u8x16_t v) > { > +#if defined(USE_NEON) > + return to_bool(bitwise_and(v, vmovq_n_u8(0x80))); > > --> return vmaxq_u8(v) > 0x7F Ah, of course. Much nicer! > +#if defined(USE_NEON) > +static pg_attribute_always_inline pg_u8x16_t > +vset(uint8 v0, uint8 v1, uint8 v2, uint8 v3, > + uint8 v4, uint8 v5, uint8 v6, uint8 v7, > + uint8 v8, uint8 v9, uint8 v10, uint8 v11, > + uint8 v12, uint8 v13, uint8 v14, uint8 v15) > +{ > + uint8 pg_attribute_aligned(16) values[16] = { > + v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15 > + }; > + return vld1q_u8(values); > +} > > --> They have this strange beast instead: > > // Doing a load like so end ups generating worse code. > // uint8_t array[16] = {x1, x2, x3, x4, x5, x6, x7, x8, > // x9, x10,x11,x12,x13,x14,x15,x16}; > // return vld1q_u8(array); > uint8x16_t x{}; > // incredibly, Visual Studio does not allow x[0] = x1 > x = vsetq_lane_u8(x1, x, 0); > x = vsetq_lane_u8(x2, x, 1); > x = vsetq_lane_u8(x3, x, 2); > ... > x = vsetq_lane_u8(x15, x, 14); > x = vsetq_lane_u8(x16, x, 15); > return x; > > Since you aligned the array, that might not have the problem alluded to above, and it looks nicer. Strange indeed. We should probably poke around in the assember and see... it might be that MSVC doesn't like it, and I was just cargo-culting the alignment. I don't expect the generated code to really "load" anything of course, it should ideally be some kind of immediate mov... FWIW here are some performance results from my humble RPI4: master: chinese | mixed | ascii ---------+-------+------- 4172 | 2763 | 1823 (1 row) Your v15 patch: chinese | mixed | ascii ---------+-------+------- 2267 | 1248 | 399 (1 row) Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1: chinese | mixed | ascii ---------+-------+------- 909 | 620 | 318 (1 row) It's so good I wonder if it's producing incorrect results :-) I also tried to do a quick and dirty AltiVec patch to see if it could fit into the same code "shape", with less immediate success: it works out slower than the fallback code on the POWER7 machine I scrounged an account on. I'm not sure what's wrong there, but maybe it's a uesful start (I'm probably confused about endianness, or the encoding of boolean vectors which may be different (is true 0x01or 0xff, does it matter?), or something else, and it's falling back on errors all the time?).
Вложения
В списке pgsql-hackers по дате отправления: