Обсуждение: Improve the performance of Unicode Normalization Forms.
Hi, hackers! As promised, I continue to improve/speed up Unicode in Postgres. Last time, we improved the lower(), upper(), and casefold() functions. [1] Now it's time for Unicode Normalization Forms, specifically the normalize() function. The attachment contains three patches: 1. Transfer of functions for building a range index to a separate Perl module. 2-small. In this patch, the mapping tables are reduced, but this increases the number of branches for searching for the desired value. 2-big. In this patch, the tables are large, but the number of branches for searching for the desired value is small. In other words, we choose between two patches, either small or big. What did we manage to achieve? 1. The uint32 field, which stored the unicode codepoint record, was removed from the main structure/table. This was necessary for perfect hashing. 2. Thanks to the first point, we managed to get rid of duplicates and reduce the main table from 6843 to 3943. 3. We managed to get rid of duplicates in the table with codepoints for decomposition from 5138 to 4304 (uint32). 4. These manipulations allowed us to increase the speed by up to 40% in some cases (benchmarks below). 5. The server part "lost weight" in the binary, but the frontend "gained weight" a little. I read the old commits, which say that the size of the frontend is very important and that speed is not important (speed is important on the server). I'm not quite sure what to do if this is really the case. Perhaps we should leave the slow version for the frontend. Benchmarks. How was it tested? Four files were created for each normalization form: NFC, NFD, NFKC, and NFKD. The files were sent via pgbench. The files contain all code points that need to be normalized. Unfortunately, the patches are already quite large, but if necessary, I can send these files in a separate email or upload them somewhere. Legend. Patch (big table) — this is a bloated mapping table, but with a small number of branches for searching. Patch (small table) — these are small tables, but with a large number of branches for searching. Without — the original Postgres without patches. * Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0) NFC: Patch (big table): tps = 5057.624509 Patch (small table): tps = 4727.158048 Without: tps = 3268.654867 NFD: Patch (big table): tps = 1145.730247 Patch (small table): tps = 1073.084633 Without: tps = 689.627239 NFKC: Patch (big table): tps = 4821.700702 Patch (small table): tps = 4662.640629 Without: tps = 3450.254581 NFKD: Patch (big table): tps = 1142.644341 Patch (small table): tps = 951.405893 Without: tps = 636.161496 How the size of object files and libraries has changed (in bytes): Patch (big table): unicode_norm.o: 138896 unicode_norm_srv.o: 191336 libpq.a: 364782 libpq.so.5.18: 480944 libpgcommon.a: 584432 libpgcommon_shlib.a: 568724 Patch (small table): unicode_norm.o: 86416 unicode_norm_srv.o: 138824 libpq.a: 364782 libpq.so.5.18: 423600 libpgcommon.a: 531952 libpgcommon_shlib.a: 516244 Without: unicode_norm.o: 79488 unicode_norm_srv.o: 165504 libpq.a: 364782 libpq.so.5.18: 419288 libpgcommon.a: 525024 libpgcommon_shlib.a: 509316 [1] https://www.postgresql.org/message-id/flat/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com -- Best regards, Alexander Borisov
Вложения
On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote: > 5. The server part "lost weight" in the binary, but the frontend > "gained weight" a little. > > I read the old commits, which say that the size of the frontend is very > important and that speed is not important > (speed is important on the server). > I'm not quite sure what to do if this is really the case. Perhaps > we should leave the slow version for the frontend. In the "small" patch, the frontend files got a few kB bigger, but the backend got quite a bit smaller. If we decided to go with this patch, I'd say it's preferable to do it in a way that keeps both paths the same. > How was it tested? > Four files were created for each normalization form: NFC, NFD, NFKC, > and NFKD. > The files were sent via pgbench. The files contain all code points that > need to be normalized. > Unfortunately, the patches are already quite large, but if necessary, > I can send these files in a separate email or upload them somewhere. What kind of workload do they present? Did you consider running the same tests from the thread that lead to the current implementation? -- John Naylor Amazon Web Services
11.06.2025 10:13, John Naylor wrote: > On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote: >> 5. The server part "lost weight" in the binary, but the frontend >> "gained weight" a little. >> >> I read the old commits, which say that the size of the frontend is very >> important and that speed is not important >> (speed is important on the server). >> I'm not quite sure what to do if this is really the case. Perhaps >> we should leave the slow version for the frontend. > > In the "small" patch, the frontend files got a few kB bigger, but the > backend got quite a bit smaller. If we decided to go with this patch, > I'd say it's preferable to do it in a way that keeps both paths the > same. Okay, then I'll leave the frontend unchanged so that the size remains the same. The changes will only affect the backend. >> How was it tested? >> Four files were created for each normalization form: NFC, NFD, NFKC, >> and NFKD. >> The files were sent via pgbench. The files contain all code points that >> need to be normalized. >> Unfortunately, the patches are already quite large, but if necessary, >> I can send these files in a separate email or upload them somewhere. > > What kind of workload do they present? > Did you consider running the same tests from the thread that lead to > the current implementation? I found performance tests in this discussion https://www.postgresql.org/message-id/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com Below are performance test results. * Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0) 1. Normalize, decomp only select count(normalize(t, NFD)) from ( select md5(i::text) as t from generate_series(1,100000) as i ) s; Patch (big table): 279,858 ms Patch (small table): 282,925 ms Without: 444,118 ms 2. select count(normalize(t, NFD)) from ( select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from generate_series(1,100000) as i ) s; Patch (big table): 219,858 ms Patch (small table): 247,893 ms Without: 376,906 ms 3. Normalize, decomp+recomp select count(normalize(t, NFC)) from ( select md5(i::text) as t from generate_series(1,1000) as i ) s; Patch (big table): 7,553 ms Patch (small table): 7,876 ms Without: 13,177 ms 4. select count(normalize(t, NFC)) from ( select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from generate_series(1,1000) as i ) s; Patch (big table): 5,765 ms Patch (small table): 6,782 ms Without: 10,800 ms 5. Quick check has not changed because these patches do not affect it: -- all chars are quickcheck YES select count(*) from ( select md5(i::text) as t from generate_series(1,100000) as i ) s; Patch (big table): 29,477 ms Patch (small table): 29,436 ms Without: 29,378 ms From these tests, we see 2x in some tests. -- Best regards, Alexander Borisov
On Wed, Jun 11, 2025 at 7:27 PM Alexander Borisov <lex.borisov@gmail.com> wrote: > > 11.06.2025 10:13, John Naylor wrote: > > On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote: > >> 5. The server part "lost weight" in the binary, but the frontend > >> "gained weight" a little. > >> > >> I read the old commits, which say that the size of the frontend is very > >> important and that speed is not important > >> (speed is important on the server). > >> I'm not quite sure what to do if this is really the case. Perhaps > >> we should leave the slow version for the frontend. > > > > In the "small" patch, the frontend files got a few kB bigger, but the > > backend got quite a bit smaller. If we decided to go with this patch, > > I'd say it's preferable to do it in a way that keeps both paths the > > same. > > Okay, then I'll leave the frontend unchanged so that the size remains > the same. The changes will only affect the backend. Sorry, I by "both paths" I meant make the frontend and backend the same, because it's good for testing. In the "small table" patch, libpq etc increase by about 1%, which is negligible. unicode_norm.o is only bigger by 7kB -- That seems okay to me, especially considering unicode_norm_srv.o is smaller by 27kB. > From these tests, we see 2x in some tests. Nice! -- John Naylor Amazon Web Services
On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote: > As promised, I continue to improve/speed up Unicode in Postgres. > Last time, we improved the lower(), upper(), and casefold() > functions. [1] > Now it's time for Unicode Normalization Forms, specifically > the normalize() function. Did you compare against other implementations, such as ICU's normalization functions? There's also a rust crate here: https://github.com/unicode-rs/unicode-normalization that might have been optimized. In addition to the lookups themselves, there are other opportunities for optimization as well, such as: * reducing the need for palloc and extra buffers, perhaps by using buffers on the stack for small strings * operate more directly on UTF-8 data rather than decoding and re- encoding the entire string Regards, Jeff Davis
19.06.2025 20:41, Jeff Davis wrote: > On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote: >> As promised, I continue to improve/speed up Unicode in Postgres. >> Last time, we improved the lower(), upper(), and casefold() >> functions. [1] >> Now it's time for Unicode Normalization Forms, specifically >> the normalize() function. > > Did you compare against other implementations, such as ICU's > normalization functions? There's also a rust crate here: > > https://github.com/unicode-rs/unicode-normalization > > that might have been optimized. I don't quite see how this compares to the implementation on Rust. In the link provided, they use perfect hash, which I get rid of and get a x2 boost. If you take ICU implementations in C++, I have always considered them slow, at least when used in C code. I may well run benchmarks and compare the performance of the approach in Postgres and ICU. But this is beyond the scope of the patches under discussion. I want to emphasize that the pachty I gave doesn't change the normalization code/logic. We change the approach in finding the right codepoints across tables, which is what gives us the performance boost. > In addition to the lookups themselves, there are other opportunities > for optimization as well, such as: > > * reducing the need for palloc and extra buffers, perhaps by using > buffers on the stack for small strings > > * operate more directly on UTF-8 data rather than decoding and re- > encoding the entire string Absolutely agree with you, the normalization code is very well written but far from optimized. I didn't send changes in the normalization code itself to avoid lumping everything together and make the review easier. In keeping with my idea of optimizations in normalization forms, I planned to discuss the optimization code (C code) in the next iteration on “Improve performance...”. -- Regards, Alexander Borisov
On Thu, Jun 19, 2025 at 10:41:57AM -0700, Jeff Davis wrote: > In addition to the lookups themselves, there are other opportunities > for optimization as well, such as: > > * reducing the need for palloc and extra buffers, perhaps by using > buffers on the stack for small strings > > * operate more directly on UTF-8 data rather than decoding and re- > encoding the entire string Not sure how relevant it is here, but in ZFS in the 00s we implement form-insensitive, from-preserving functionality with an interesting optimization that greatly reduced allocations and which had a decent fast path for ASCII-mostly strings. It works by doing normalization only in a) string comparison functions, b) string hashing functions (because in ZFS directories are hash tables). For b-trees one has to normalize the whole string, so perhaps this scheme is just not that relevant to databases that use b-trees, but then depending on how the b-tree key is smeared over internal nodes it might be. The ASCII-mostly fast path was that when comparing strings you want a "cursor" through each string, and whenever the current and next bytes are both ASCII you compare the current one the fast way (as ASCII), and you only take the slow path of having to check if the current _character_ (as opposed to _byte_) requires normalization when either the current or next bytes are not ASCII. In the slow path you only normalize the _current character_, so you only need enough buffer space for that. In principle a Unicode character could consist of an unbounded number of codepoints (zalgo), I think, but in practice it will be no more than a half a dozen or so, so the buffer can be small and stack allocated. This same concept applies to string hashing: you walk a cursor over the string and hash each ASCII _character_ and normalize all non-ASCII characters, all one character at a time. The really nice thing about form-insensitive/form-preserving functionality is that it is form-preserving rather than normalizing on create and lookup, and that makes the fast-path described above feasible. Whereas if you normalize on create and lookup you have to heap allocate enough space for each string normalized. The other nice thing is that f-i/f-p behavior is a lot less surprising to users -- the input methods they use don't matter. What motivated this f-i/f-p behavior was that HFS+ used NFD (well, something very close to NFD) but input methods (even on OS X) invariably produce NFC (well, something close to it), at least for Latin scripts. This means that on OS X if you use filenames from directory listings those will be NFD while user inputs will be NFC, and so you can't just memcmp()/strcmp() those -- you have to normalize _or_ use form- insensitive string comparison, but nothing did that 20 years ago. Thus doing the form-insensitivity in the filesystem seemed best, and if you do that you can be form-preserving to enable the optimization described above. My apologies if the above is not relevant to PG and thus noise, Nico --
On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote: > In the slow path you only > normalize the _current character_, so you only need enough buffer > space > for that. That's a clear win for UTF8 data. Also, if there are no changes, then you can just return the input buffer and not bother allocating an output buffer. > The really nice thing about form-insensitive/form-preserving > functionality is that it is form-preserving rather than normalizing > on > create and lookup, and that makes the fast-path described above > feasible. Whereas if you normalize on create and lookup you have to > heap allocate enough space for each string normalized. Non-deterministic ICU collations are already insensitive to most normalization differences. Some differences are not handled when it involves too many combining marks, but you can use the "full normalization" option to address that. See: https://www.postgresql.org/docs/current/collation.html#ICU-COLLATION-SETTINGS-TABLE > The other nice > thing is that f-i/f-p behavior is a lot less surprising to users -- > the > input methods they use don't matter. Postgres is already form-preserving; it does not auto-normalize. (I have suggested that we might want to offer something like that, but that would be a user choice.) Currently, the non-deterministic collations (which offer form- insensitivity) are not available at the database level, so you have to explicitly specify the COLLATE clause on a column or query. In other words, Postgres is not form-insensitive by default, though there is work to make that possible. > What motivated this f-i/f-p behavior was that HFS+ used NFD (well, > something very close to NFD) but input methods (even on OS X) > invariably > produce NFC (well, something close to it), at least for Latin > scripts. > This means that on OS X if you use filenames from directory listings > those will be NFD while user inputs will be NFC, and so you can't > just > memcmp()/strcmp() those -- you have to normalize _or_ use form- > insensitive string comparison, but nothing did that 20 years ago. > Thus > doing the form-insensitivity in the filesystem seemed best, and if > you > do that you can be form-preserving to enable the optimization > described > above. Databases have similar concerns as a filesystem in this respect. Regards, Jeff Davis
On Fri, 2025-06-20 at 17:51 +0300, Alexander Borisov wrote: > I don't quite see how this compares to the implementation on Rust. In > the link provided, they use perfect hash, which I get rid of and get > a x2 boost. > If you take ICU implementations in C++, I have always considered them > slow, at least when used in C code. > I may well run benchmarks and compare the performance of the approach > in Postgres and ICU. But this is beyond the scope of the patches > under > discussion. Are you saying that, with these patches, Postgres will offer the fastest open-source Unicode normalization? If so, that would be very cool. The reason I'm asking is because, if there are multiple open source implementations, we should either have the best one, or just borrow another one as long as it has a suitable license (perhaps translating to C as necessary). Regards, Jeff Davis
On Fri, Jun 20, 2025 at 10:15:47AM -0700, Jeff Davis wrote: > On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote: > > In the slow path you only normalize the _current character_, so you > > only need enough buffer space for that. > > That's a clear win for UTF8 data. Also, if there are no changes, then > you can just return the input buffer and not bother allocating an > output buffer. The latter is not relevant to string comparison or hashing, but, yeah, if you have to produce a normalized string you can optimistically assume it is already normalized and defer allocation until you know it isn't normalized. > Postgres is already form-preserving; it does not auto-normalize. (I > have suggested that we might want to offer something like that, but > that would be a user choice.) Excellent, then I would advise looking into adding form-insensitive string comparison and hashing to get f-i/f-p behavior. > Currently, the non-deterministic collations (which offer form- > insensitivity) are not available at the database level, so you have to > explicitly specify the COLLATE clause on a column or query. In other > words, Postgres is not form-insensitive by default, though there is > work to make that possible. TIL. Thanks. > Databases have similar concerns as a filesystem in this respect. I figured :)
20.06.2025 20:20, Jeff Davis wrote: > On Fri, 2025-06-20 at 17:51 +0300, Alexander Borisov wrote: >> I don't quite see how this compares to the implementation on Rust. In >> the link provided, they use perfect hash, which I get rid of and get >> a x2 boost. >> If you take ICU implementations in C++, I have always considered them >> slow, at least when used in C code. >> I may well run benchmarks and compare the performance of the approach >> in Postgres and ICU. But this is beyond the scope of the patches >> under >> discussion. > > Are you saying that, with these patches, Postgres will offer the > fastest open-source Unicode normalization? If so, that would be very > cool. That's what we're aiming for - to implement the fastest approach. By applying the proposed patches (two patches) we get the fastest codepoints search by tables. This is evidenced by the measurements made here and earlier in the patch for unicode case improvement. After these patches are compiled, I will improve the C normalization code directly, optimize it. That's when we can take benchmarks and say with confidence that we're the best at speed. > The reason I'm asking is because, if there are multiple open source > implementations, we should either have the best one, or just borrow > another one as long as it has a suitable license (perhaps translating > to C as necessary). Before getting into this "fight" I studied different approaches to searching for the necessary codepoints in tables (hash, binary search, radix trees...) and came to the conclusion that the approach I proposed (range index) is the fastest for this purpose. -- Regards, Alexander Borisov
On Tue, 2025-06-24 at 18:20 +0300, Alexander Borisov wrote: > That's what we're aiming for - to implement the fastest approach. Awesome! Thank you for clarifying this as a goal. Having the fastest open-source Unicode normalization would be a great thing to highlight when this is done. Regards, Jeff Davis
07.07.2025 21:44, Alexander Borisov пишет: > 30.06.2025 22:28, Jeff Davis пишет: >> On Tue, 2025-06-24 at 18:20 +0300, Alexander Borisov wrote: >>> That's what we're aiming for - to implement the fastest approach. >> >> Awesome! Thank you for clarifying this as a goal. Having the fastest >> open-source Unicode normalization would be a great thing to highlight >> when this is done. > > After the discussion in this correspondence, we are settling on > the "small" patch option. The table size is reduced, the speed is > almost x2. > Attached are the final two patches. After reviewing/accepting them, > I will proceed to optimizing the C code for Unicode Normalization Forms. Version 3 patches. In version 2 "make -s headerscheck" did not work. -- Regards, Alexander Borisov
Вложения
Hi, I'm new here, so please advise me: if a patch wasn't accepted at the commitfest, does that mean it's not needed (no one was interested in it), or was there not enough time? Please tell me how this works for this. Should I move it to the next commitfest? I'm not quite sure what to do. I looked and saw that patches are often transferred from commitfest to commitfest. I understand that this is normal practice? Please understand, it's not very transparent here, the approach is not obvious. What is the best course of action for me? Thanks! -- Regards, Alexander Borisov
On Friday, August 1, 2025, Alexander Borisov <lex.borisov@gmail.com> wrote:
I looked and saw that patches are often transferred from commitfest to
commitfest. I understand that this is normal practice?
What is the best course of action for me?
If you feel the patch is committable it should remain in the non-draft CFs, being moved every other month or so, until it gets committed.
David J.
Alexander Borisov <lex.borisov@gmail.com> writes: > I'm new here, so please advise me: if a patch wasn't accepted at the > commitfest, does that mean it's not needed (no one was interested in > it), or was there not enough time? It's kind of hard to tell really --- there are many patches in our queue and not nearly enough reviewers. So maybe someone will get to it in the fullness of time, or maybe it's true that no one cares about the particular topic. (But bug fixes and performance improvements are almost always interesting to someone.) I recommend optimism: as long as *you* still believe that the patch is worthwhile, keep pushing it forward to the next commitfest. We used to do that automatically, but we have started asking authors to do that themselves, as a way of weeding out patches for which the author has lost interest. regards, tom lane
01.08.2025 23:37, Tom Lane пишет: > Alexander Borisov <lex.borisov@gmail.com> writes: >> I'm new here, so please advise me: if a patch wasn't accepted at the >> commitfest, does that mean it's not needed (no one was interested in >> it), or was there not enough time? > > It's kind of hard to tell really --- there are many patches in our > queue and not nearly enough reviewers. So maybe someone will get to > it in the fullness of time, or maybe it's true that no one cares > about the particular topic. (But bug fixes and performance > improvements are almost always interesting to someone.) > > I recommend optimism: as long as *you* still believe that the patch > is worthwhile, keep pushing it forward to the next commitfest. > We used to do that automatically, but we have started asking authors > to do that themselves, as a way of weeding out patches for which > the author has lost interest. Thanks, Tom! I always choose optimism. I've been in open source for a while, and this is the first time I've seen this approach. I have a plan to further improve Postgres performance in terms of Unicode (and not only) (which is kind of the foundation for working with text). I don't want to overwhelm the community with patches. I take a systematic approach. Once again, thank you, Tom. The community's approach has become clearer. -- Regards, Alexander Borisov
On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote: > Version 3 patches. In version 2 "make -s headerscheck" did not work. I ran my own performance tests. What I did was get some test data from ICU v76.1 by doing: cat icu4j/perf-tests/data/collation/Test* \ | uconv -f utf-8 -t utf-8 -x nfc > ~/strings.nfc.txt cat icu4j/perf-tests/data/collation/Test* \ | uconv -f utf-8 -t utf-8 -x nfd > ~/strings.nfd.txt export NORM_PERF_NFC_FILE=~/strings.nfc.txt export NORM_PERF_NFD_FILE=~/strings.nfd.txt The first is about 8MB, the second 9MB (because NFD is slightly larger). Then I added some testing code to norm_test.c. It's not intended for committing, just to run the test. Note that it requires setting environment variables to find the input files. If patch v3j-0001 are applied, it's using perfect hashing. If patches v3j-0002-4 are applied, it's using your code. In either case it compares with ICU. Results with perfect hashing (100 iterations): Normalization from NFC to NFD with PG: 010.009 Normalization from NFC to NFD with ICU: 001.580 Normalization from NFC to NFKD with PG: 009.376 Normalization from NFC to NFKD with ICU: 000.857 Normalization from NFD to NFC with PG: 016.026 Normalization from NFD to NFC with ICU: 001.205 Normalization from NFD to NFKC with PG: 015.903 Normalization from NFD to NFKC with ICU: 000.654 Results with your code (100 iterations): Normalization from NFC to NFD with PG: 004.626 Normalization from NFC to NFD with ICU: 001.577 Normalization from NFC to NFKD with PG: 004.024 Normalization from NFC to NFKD with ICU: 000.861 Normalization from NFD to NFC with PG: 006.846 Normalization from NFD to NFC with ICU: 001.209 Normalization from NFD to NFKC with PG: 006.655 Normalization from NFD to NFKC with ICU: 000.651 Your patches are a major improvement, but I'm trying to figure out why ICU still wins by so much. Thoughts? I didn't investigate much myself yet, so it's quite possible there's a bug in my test or something. Regards, Jeff Davis
Вложения
09.08.2025 02:17, Jeff Davis пишет: > On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote: >> Version 3 patches. In version 2 "make -s headerscheck" did not work. > > I ran my own performance tests. What I did was get some test data from > ICU v76.1 by doing: > [..] > Results with perfect hashing (100 iterations): > > Normalization from NFC to NFD with PG: 010.009 > Normalization from NFC to NFD with ICU: 001.580 > Normalization from NFC to NFKD with PG: 009.376 > Normalization from NFC to NFKD with ICU: 000.857 > Normalization from NFD to NFC with PG: 016.026 > Normalization from NFD to NFC with ICU: 001.205 > Normalization from NFD to NFKC with PG: 015.903 > Normalization from NFD to NFKC with ICU: 000.654 > > Results with your code (100 iterations): > > Normalization from NFC to NFD with PG: 004.626 > Normalization from NFC to NFD with ICU: 001.577 > Normalization from NFC to NFKD with PG: 004.024 > Normalization from NFC to NFKD with ICU: 000.861 > Normalization from NFD to NFC with PG: 006.846 > Normalization from NFD to NFC with ICU: 001.209 > Normalization from NFD to NFKC with PG: 006.655 > Normalization from NFD to NFKC with ICU: 000.651 > > Your patches are a major improvement, but I'm trying to figure out why > ICU still wins by so much. Thoughts? I didn't investigate much myself > yet, so it's quite possible there's a bug in my test or something. I had some experimented a bit, and took a look at the ICU code. 1. The performance test is not entirely accurate. The thing is that in ICU (unorm_normalize()), we pass the output buffer and its size. That is, ICU does not calculate how much data we will have at the output; we pass it our buffer. ICU simply checks whether the data fits into the buffer or not. In our case, PG (unicode_normalize()) only accepts the input buffer and first calculates the size of the output buffer. This means we are doing double the work, as we have already gone through the input data at least once too many times. Here, it would be more honest to call the function for calculating the buffer in ICU, and only then call data normalization. 2. In PG, unnecessary table lookups (get_code_entry()) that can clearly be avoided. 3. In PG, the algorithm is not optimized. We could eliminate the decompose_code() function, which is called for each code point. In this function, we can remove recursion and prepare the data in advance. That is, we can perform data expansion in the Perl script. If we do this, we will have code that is in place and without recursion. And then there are all sorts of other minor details. Of course, there are other approaches. I did this comparison (your test, Jeff): 1. Without patch. Normalization from NFC to NFD with PG: 009.121 Normalization from NFC to NFD with ICU: 000.954 Normalization from NFC to NFKD with PG: 009.048 Normalization from NFC to NFKD with ICU: 000.965 Normalization from NFD to NFC with PG: 014.525 Normalization from NFD to NFC with ICU: 000.623 Normalization from NFD to NFKC with PG: 014.380 Normalization from NFD to NFKC with ICU: 000.666 2. With patch. Normalization from NFC to NFD with PG: 005.743 Normalization from NFC to NFD with ICU: 001.005 Normalization from NFC to NFKD with PG: 005.902 Normalization from NFC to NFKD with ICU: 000.963 Normalization from NFD to NFC with PG: 007.837 Normalization from NFD to NFC with ICU: 000.640 Normalization from NFD to NFKC with PG: 008.036 Normalization from NFD to NFKC with ICU: 000.667 3. With a patch, but! With direct access to tables, i.e., I created one large table (index table, unit16_t) where there is no search, direct access to data. Normalization from NFC to NFD with PG: 002.792 Normalization from NFC to NFD with ICU: 000.953 Normalization from NFC to NFKD with PG: 002.865 Normalization from NFC to NFKD with ICU: 000.958 Normalization from NFD to NFC with PG: 004.361 Normalization from NFD to NFC with ICU: 000.651 Normalization from NFD to NFKC with PG: 004.474 Normalization from NFD to NFKC with ICU: 000.668 It is evident that direct access provides x2 from the patch, but adds +1.5MB to the object file size. This is just to check the time difference in access. As a result, I would not look into ICU at the moment, especially since we have a different approach. I am currently working on optimizing unicode_normalize(). I am trying to come up with an improved version of the algorithm in C by the next commitfest (which will be in September). P.S.: In general, I strive to surpass ICU in terms of speed. I think we're making good progress. Let's see what happens. -- Regards, Alexander Borisov
On Mon, 2025-08-11 at 17:21 +0300, Alexander Borisov wrote: > As a result, I would not look into ICU at the moment, especially > since > we have a different approach. > I am currently working on optimizing unicode_normalize(). > I am trying to come up with an improved version of the algorithm in C > by the next commitfest (which will be in September). Agreed, but thank you for adding context so I can understand where we are. The patch as proposed is a speedup, and also a simplification because it eliminates the different code path for the frontend code. That also makes me feel better about testing, because I don't think both those paths were tested equally. Comments on the patch itself: The 0001 patch generalizes the two-step lookup process: first navigate branches to find the index into a partially-compacted sparse array, and then use that to get the index into the dense array. The branching code, the sparse array, and the dense array are all generated code. The reason for the two-step lookup is to keep the sparse array element size small (uint16), so that missing elements consume less space (missing elements still remain for small gaps). The full entry is kept in the dense array. GenerateSparseArray.pm would be more descriptive than "Ranges.pm" for the new module. And we should probably include "sparse" in the name of the sparse arrays. The new module is responsible for generating the branching code as well as the sparse array; while the caller is reponsible for the dense arrays. For case mapping, one sparse array is used for four parallel arrays for the different case kinds (lower/title/upper/fold). The use of zero values for different purposes is getting confusing. It means "doesn't exist", but we are also reserving the zeroth element in the arrays. Would it be easier to just "#define EMPTY 0xFFFF" and then have the caller check for it? That way we don't need to reserve the zeroth array element, which should make it easier to avoid off-by-one errors. I think we can simplify the interface, as well. Why does the caller need to separately generate the ranges, then generate the table, then generate the branches? It's really all the same action and can be based on an input hash with a certain structure, and then return both the table and the branches, right? Regards, Jeff Davis
14.08.2025 01:12, Jeff Davis wrote: > On Mon, 2025-08-11 at 17:21 +0300, Alexander Borisov wrote: [..] > > Comments on the patch itself: > > The 0001 patch generalizes the two-step lookup process: first navigate > branches to find the index into a partially-compacted sparse array, and > then use that to get the index into the dense array. The branching > code, the sparse array, and the dense array are all generated code. The > reason for the two-step lookup is to keep the sparse array element size > small (uint16), so that missing elements consume less space (missing > elements still remain for small gaps). The full entry is kept in the > dense array. Yes, that's exactly right. > GenerateSparseArray.pm would be more descriptive than "Ranges.pm" for > the new module. And we should probably include "sparse" in the name of > the sparse arrays. I don't mind, that's good. > The new module is responsible for generating the branching code as well > as the sparse array; while the caller is reponsible for the dense > arrays. For case mapping, one sparse array is used for four parallel > arrays for the different case kinds (lower/title/upper/fold). Yes, that's right. > The use of zero values for different purposes is getting confusing. It > means "doesn't exist", but we are also reserving the zeroth element in > the arrays. Would it be easier to just "#define EMPTY 0xFFFF" and then > have the caller check for it? That way we don't need to reserve the > zeroth array element, which should make it easier to avoid off-by-one > errors. Initially, that was the case; However, I subsequently concluded that creating magical values to define an element that was not found was not an ideal solution. I felt that 0 was easier to understand. I'm not entirely sure now, but maybe this will even help get rid of the comparison to 0. That is, we can get an element from the main table by index 0. It will be zeroed, and the algorithm will work as it should. However, it is not certain that this will have any effect on performance. In general, I don't have a firm position on this issue. > I think we can simplify the interface, as well. Why does the caller > need to separately generate the ranges, then generate the table, then > generate the branches? It's really all the same action and can be based > on an input hash with a certain structure, and then return both the > table and the branches, right? Yes, it can be done in one go. But the separation was done intentionally. It makes the code more understandable, and, more importantly, it makes it easier for developers "from the street" to understand it and, if necessary, correct or supplement the algorithm. These are utilities, code for generating an algorithm that, let's say, will not be used very often, and I would not "compress" it. Here, it is more important for us to understand how it works. -- Regards, Alexander Borisov
Hi, Jeff, hackers! As promised, refactoring the C code for Unicode Normalization Forms. In general terms, here's what has changed: 1. Recursion has been removed; now data is generated using a Perl script. 2. Memory is no longer allocated for uint32 for the entire size, but uint8 is allocated for the entire size for the CCC cache, which boosts performance significantly. 3. The code for the unicode_normalize() function has been completely rewritten. I am confident that we have achieved excellent results. Jeff's test: Without patch: Normalization from NFC to NFD with PG: 009.121 Normalization from NFC to NFKD with PG: 009.048 Normalization from NFD to NFC with PG: 014.525 Normalization from NFD to NFKC with PG: 014.380 Whith patch: Normalization from NFC to NFD with PG: 001.580 Normalization from NFC to NFKD with PG: 001.634 Normalization from NFD to NFC with PG: 002.979 Normalization from NFD to NFKC with PG: 003.050 Test with ICU (with path and ICU): Normalization from NFC to NFD with PG: 001.580 Normalization from NFC to NFD with ICU: 001.880 Normalization from NFC to NFKD with PG: 001.634 Normalization from NFC to NFKD with ICU: 001.857 Normalization from NFD to NFC with PG: 002.979 Normalization from NFD to NFC with ICU: 001.144 Normalization from NFD to NFKC with PG: 003.050 Normalization from NFD to NFKC with ICU: 001.260 pgbench: The files were sent via pgbench. The files contain all code points that need to be normalized. NFC: Patch: tps = 9701.568161 Without: tps = 6820.828104 NFD: Patch: tps = 2707.155148 Without: tps = 1745.949174 NFKC: Patch: tps = 9893.952804 Without: tps = 6697.358888 NFKD: Patch: tps = 2580.785909 Without: tps = 1521.058417 To ensure fairness in testing with ICU, I corrected Jeff's patch; we calculate the size of the final buffer, and I placed ICU in the same position. I'm talking about: Get size: length = unorm_normalize(u_input, -1, form, 0, NULL, 0, &status); Normalize: unorm_normalize(u_input, -1, form, 0, u_result, length, &status); Otherwise, it turned out that we were giving the ICU some huge buffer, and it was writing to it. And we ourselves calculate what buffer we need. -- Regards, Alexander Borisov
Вложения
ср, 3 сент. 2025 г. в 09:35, Alexander Borisov <lex.borisov@gmail.com>:
Hi, Jeff, hackers!
As promised, refactoring the C code for Unicode Normalization Forms.
In general terms, here's what has changed:
1. Recursion has been removed; now data is generated using
a Perl script.
2. Memory is no longer allocated for uint32 for the entire size,
but uint8 is allocated for the entire size for the CCC cache, which
boosts performance significantly.
3. The code for the unicode_normalize() function has been completely
rewritten.
I am confident that we have achieved excellent results.
Hey.
I've looked into these patches.
Patches apply, compilation succeedes, make check and make installcheck shows
no errors.
Code quality is good, although I suggest a native english speaker to review
comments and commit messages — a bit difficult to follow.
Description of the Sparse Array approach is done in the newly introduced
GenerateSparseArray.pm module. Perhaps it'd be valuable to add a section into
the src/common/unicode/README, it'll get more visibility.
( Not insisting here. )
For performance testing I've used an approach by Jeff Davis. [1]
I've prepared NFC and NFD files, loaded them into UNLOGGED tables and measured
normalize() calls.
CREATE UNLOGGED TABLE strings_nfd (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfd FROM '/var/lib/postgresql/strings.nfd.txt';
CREATE UNLOGGED TABLE strings_nfc (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfc FROM '/var/lib/postgresql/strings.nfc.txt';
SELECT count( normalize( str, NFD ) ) FROM strings_nfd, generate_series( 1, 10 ) x;
SELECT count( normalize( str, NFC ) ) FROM strings_nfc, generate_series( 1, 10 ) x;
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfd FROM '/var/lib/postgresql/strings.nfd.txt';
CREATE UNLOGGED TABLE strings_nfc (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfc FROM '/var/lib/postgresql/strings.nfc.txt';
SELECT count( normalize( str, NFD ) ) FROM strings_nfd, generate_series( 1, 10 ) x;
SELECT count( normalize( str, NFC ) ) FROM strings_nfc, generate_series( 1, 10 ) x;
And I've got the following numbers:
Master
NFD Time: 2954.630 ms / 295ms
NFC Time: 3929.939 ms / 330ms
Patched
NFD Time: 1658.345 ms / 166ms / +78%
NFC Time: 1862.757 ms / 186ms / +77%
Overall, I find these patches and performance very nice and valuable.
I've added myself as a reviewer and marked this patch as Ready for Committer.
[1] https://postgr.es/m/adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com
Victor Yegorov
> Hey. > > I've looked into these patches. Hi Victor, Thank you for reviewing the patch and testing it! [..] > Description of the Sparse Array approach is done in the newly introduced > GenerateSparseArray.pm module. Perhaps it'd be valuable to add a > section into > the src/common/unicode/README, it'll get more visibility. > ( Not insisting here. ) It seems that the description of the algorithm that forms the script is best kept in the script itself. But I'm not sure either, because I don't know what is customary in the PG community. -- Regards, Alexander Borisov
On Thu, 2025-09-11 at 20:51 +0300, Alexander Borisov wrote: > > > Hey. > > > > I've looked into these patches. > > Hi Victor, > > Thank you for reviewing the patch and testing it! Heikki, do you have thoughts on this thread? Regards, Jeff Davis
20.09.2025 04:03, Jeff Davis wrote: > On Thu, 2025-09-11 at 20:51 +0300, Alexander Borisov wrote: >> >>> Hey. >>> >>> I've looked into these patches. >> >> Hi Victor, >> >> Thank you for reviewing the patch and testing it! > > Heikki, do you have thoughts on this thread? Hey, In patch v5 (attached), I changed the approach to class caching. Now, for small texts (less than 512 characters), we don't allocate memory from the heap; we use the stack. And according to pgbench tests, we have a 2x speedup. According to Jeff's tests, we have a 10x speedup. According to tests from the thread https://www.postgresql.org/message-id/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com, we also have a 2x speedup. -- Regards, Alexander Borisov