Re: Improve the performance of Unicode Normalization Forms.
| От | Alexander Borisov |
|---|---|
| Тема | Re: Improve the performance of Unicode Normalization Forms. |
| Дата | |
| Msg-id | 58c6cea8-a5b1-4a8a-8044-18ef0011a168@gmail.com обсуждение исходный текст |
| Ответ на | Re: Improve the performance of Unicode Normalization Forms. (Heikki Linnakangas <hlinnaka@iki.fi>) |
| Список | pgsql-hackers |
12.12.2025 15:22, Heikki Linnakangas wrote:
> On 24/11/2025 11:55, Alexander Borisov wrote:
>> Hey guys, any news?
>
> I finally got around to look at this, thanks for your patience!
>
Hi Heikki,
First of all, I would like to thank you for the review.
I tried to take your comments into account and test the approaches.
About major changes/improvements
The branching kept bothering me. The algorithm consisted of a function
with branches (essentially a binary search with offsets) and a table of
indexes. I really wanted to get rid of the branches. And then it dawned
on me that instead of building branches with offsets, I could create
another table that would contain these offsets.
The "inspiration" came from my patch Optimization of the is_normalized()
function [0].
Actually, I changed the GenerateSparseArray algorithm, which now
generates a table with offsets instead of a function with branches.
In other words, we now have two tables:
1. A table with offsets, indicating what offset is needed to obtain the
desired index from the index table.
2. A table with indexes. The data is divided into fixed ranges.
How does it look, figuratively speaking?
Let's take a fixed range of size 16.
Let's take data to fill in 10..20, 100..110
(let's imagine that these are codepoints).
1. Let's create the first table with offsets:
static const uint8 table_offset[8] =
{
16, 32, 0, 0, 0, 0, 48, 0
};
2. Create a second table with indexes to the data table:
static const uint8 table_index[63] =
{
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0,
0, 19, 41, 152, 147, 111, 138, 46, 50, 133, 41, 164, 0, 0, 0, 0,
0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 174, 47, 75, 31, 235, 112, 103, 167, 130,
11, 240
};
(indexes/data within table_index are provided as an example)
The first range is allocated for a dummy index (like a NULL).
3. Lookup function:
static uint8
get_index(char32_t cp)
{
uint8 offset_idx, offset;
offset_idx = cp >> 4;
if (offset_idx > 7)
return 0;
offset = table_offset[offset_idx];
return table_index[offset + (cp & 15)];
}
As you can see from the example, we only have one branch left, to check
for going outside the bounds.
By changing the size of the fixed range, we can compact either one table
or the other.
You can read more about this in ./src/tools/GenerateSparseArray.pm
This is the main change to the algorithm.
Regarding recommendations from the review
> Thinking of GenerateSparseArray's API, I think what the caller really
> wants is to generate a function like:
>
> /*
> * Look up the value of 'x'.
> */
> static uint16
> lookup(uint16)
> {
> ...
> }
>
> That's how the PerfectHash module works. The tables should be
> implementation detail that the caller doesn't need to know about.
Yes, now it is being generated together with the tables.
[..]
> The 'make update-unicode' and 'ninja update-unicode' targets are broken.
> Need to be updated for the removal of 'unicode_norm_hashfunc.h'.
Fixed.
> The GenerateSparseArray adds comments like "/* U+1234 */" to each
> element. That's nice but it implies that the elements are unicode code
> points. GenerateSparseArray could be used for many other things. Let's
> use "/* 0x1234 */" instead, or make it somehow configurable.
>
> The generated file is very large, over 1 MB. I guess it doesn't matter
> all that much, but perhaps we should generate a little more compact
> code. Maybe we don't need the "/* U+1234 */" comment on every line, but
> only one comment for each range, for example.
I have many places where I need to generate C tables. To make my life
easier, and perhaps the lives of others, I added a small module for
generating "pretty" tables. ./src/tools/PrettyLine.pm
After applying it, the size of the generated header file was reduced
to ~300KB.
> typedef struct
> {
> uint8 comb_class; /* combining class of character */
> uint8 dec_size_flags; /* size and flags of decomposition code list */
> uint16 dec_index; /* index into UnicodeDecomp_codepoints, or the
> * decomposition itself if DECOMP_INLINE */
> } pg_unicode_decomposition;
>
> The 'UnicodeDecomp_codepoints' array mentioned in the comment doesn't
> exist anymore.
Fixed.
> Finally, some ideas for packing the arrays more tightly. I'm not sure if
> these make any difference in practice or are worth the effort, but here
> we go:
>
> There are only 56 distinct comb_classes, and 18 distinct dec_size_flags,
> so if we added one more small lookup table for them, they could be
> packed into a single byte. That would shrink pg_unicode_decomposition
> from 4 to 3 bytes.
I tried different ways to optimize this place. If we create a separate
index for Canonical Combining Class using existing approaches, it will
create a table that is almost the same size as the main one where
"uint8 comb_class;" is currently located. The data is highly scattered.
The current approach to storing comb_class is the golden mean.
> static const uint16 decomp_map[33752] =
> This array consists of multiple ranges, and each range is accessed in a
> separate piece of code. We could use multiple arrays, one for each
> range, instead of one big array. Some of the ranges only store a small
> range of values, so we could use uint8 for them.
In the current approach, the total number of records in the tables
(offset and index) is approximately 21,000 uint16 records.
This is approximately 12,000 fewer than before.
It is possible to attempt to further divide the data, but the complexity
of obtaining a record (lookup) is significantly reduced.
Benchmarks
tables offset -- current algorithm, latest.
branches offset -- previous algorithm with branches in the lookup function.
without -- the original Postgres without patches.
* macOS 15.6.1 (Apple M3 Pro) (Apple clang version 17.0.0)
NFC:
Patch (tables offset): tps = 11180.259769
Patch (branches offset): tps = 11051.958331
Without: tps = 7540.085014
NFD:
Patch (tables offset): tps = 2962.334547
Patch (branches offset): tps = 2855.137283
Without: tps = 1841.182279
NFKC:
Patch (tables offset): tps = 11313.345548
Patch (branches offset): tps = 11239.604566
Without: tps = 7498.037333
NFKD:
Patch (tables offset): tps = 2828.025161
Patch (branches offset): tps = 2702.252719
Without: tps = 1556.659947
We can see that there is a difference between the branching and table
approaches. It may not be large, perhaps not even significant, but it
is there.
Summary of all patches:
1. The uint32 field, which stored the unicode codepoint record, was
removed from the main structure/table. This was necessary for perfect
hashing.
2. Thanks to the first point, we managed to get rid of duplicates and
reduce the main table from 6843 to 3957.
3. We managed to get rid of duplicates in the table with codepoints for
decomposition from 5138 to 4931 (uint32).
4. Recursion has been removed; now data is generated using
a Perl script.
5. Memory is no longer allocated for uint32 for the entire size,
but uint8 is allocated for the entire size for the CCC cache, which
boosts performance significantly. For small texts (less than 512
characters), we don't allocate memory from the heap; we use
the stack.
6. The backend and frontend have a single code base.
7. And most importantly, performance has increased significantly.
[0]
https://www.postgresql.org/message-id/387cd5a3-2600-4c3c-b236-576087ef7062%40gmail.com
--
Regards,
Alexander Borisov
Вложения
В списке pgsql-hackers по дате отправления: