Re: [POC] verifying UTF-8 using SIMD instructions
От | Heikki Linnakangas |
---|---|
Тема | Re: [POC] verifying UTF-8 using SIMD instructions |
Дата | |
Msg-id | b92ed011-02b6-1403-741d-ae3f017001b5@iki.fi обсуждение исходный текст |
Ответ на | Re: [POC] verifying UTF-8 using SIMD instructions (John Naylor <john.naylor@enterprisedb.com>) |
Ответы |
Re: [POC] verifying UTF-8 using SIMD instructions
Re: [POC] verifying UTF-8 using SIMD instructions Re: [POC] verifying UTF-8 using SIMD instructions |
Список | pgsql-hackers |
On 07/02/2021 22:24, John Naylor wrote: > Here is a more polished version of the function pointer approach, now > adapted to all multibyte encodings. Using the not-yet-committed tests > from [1], I found a thinko bug that resulted in the test for nul bytes > to not only be wrong, but probably also elided by the compiler. Doing it > correctly is noticeably slower on pure ascii, but still several times > faster than before, so the conclusions haven't changed any. I'll run > full measurements later this week, but I'll share the patch now for review. As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's algorithm from the simdjson library [1], see attached patch. I microbenchmarked it using the the same test I used before [2]. These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1 20210110" unpatched master: postgres=# \i mbverifystr-speed.sql CREATE FUNCTION mixed | ascii -------+------- 728 | 393 (1 row) v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch: mixed | ascii -------+------- 759 | 98 (1 row) simdjson-utf8-hack.patch: mixed | ascii -------+------- 53 | 31 (1 row) So clearly that algorithm is fast. Not sure if it has a high startup cost, or large code size, or other tradeoffs that we don't want. At least it depends on SIMD instructions, so it requires more code for the architecture-specific implementations and autoconf logic and all that. Nevertheless I think it deserves a closer look, I'm a bit reluctant to put in half-way measures, when there's a clearly superior algorithm out there. I also tested the fallback implementation from the simdjson library (included in the patch, if you uncomment it in simdjson-glue.c): mixed | ascii -------+------- 447 | 46 (1 row) I think we should at least try to adopt that. At a high level, it looks pretty similar your patch: you load the data 8 bytes at a time, check if there are all ASCII. If there are any non-ASCII chars, you check the bytes one by one, otherwise you load the next 8 bytes. Your patch should be able to achieve the same performance, if done right. I don't think the simdjson code forbids \0 bytes, so that will add a few cycles, but still. [1] https://github.com/simdjson/simdjson [2] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi - Heikki PS. Your patch as it stands isn't safe on systems with strict alignment, the string passed to the verify function isn't guaranteed to be 8 bytes aligned. Use memcpy to fetch the next 8-byte chunk to fix.
Вложения
В списке pgsql-hackers по дате отправления: