Обсуждение: Re: Speed up ICU case conversion by using ucasemap_utf8To*()
On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
> "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
>
> master: ~540 ms
> Patched: ~460 ms
> glibc: ~410 ms
It looks like you are opening and closing the UCaseMap object each
time. Why not save it in pg_locale_t? That should speed it up even more
and hopefully beat libc.
Also, to support older ICU versions consistently, we need to fix up the
locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
that logic?
Regards,
Jeff Davis
On 12/20/24 8:24 PM, Jeff Davis wrote:
> On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
>> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
>> "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
>>
>> master: ~540 ms
>> Patched: ~460 ms
>> glibc: ~410 ms
>
> It looks like you are opening and closing the UCaseMap object each
> time. Why not save it in pg_locale_t? That should speed it up even more
> and hopefully beat libc.
Fixed. New benchmarks are:
SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);
master: ~570 ms
Patched: ~340 ms
glibc: ~400 ms
So it does indeed seem like we got a further speedup and now are faster
than glibc.
> Also, to support older ICU versions consistently, we need to fix up the
> locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
> that logic?
Fixed.
Andreas
Вложения
Hi Andreas, On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-criticalcomments to share. In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we'reopening the `UCaseMap` here, rather than performing a "lookup" operation. In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate— `Additionally` would be a better replacement. -- Regards, Man Zeng www.openhalo.org
On 12/31/25 3:36 AM, zengman wrote: > On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor,non-critical comments to share. Thanks for trying it out! > In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we'reopening the `UCaseMap` here, rather than performing a "lookup" operation. Fixed. > In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate— `Additionally` would be a better replacement. Fixed. Andreas
Вложения
Hi, Here is a version 4 of the patch which uses the fact that we have method tables to remove one level of indirection. I am not sure the extra lines of codes are worth it but on the other hand despite 40 more lines the code became easier to read to me. What do you think? Andreas
Вложения
> Here is a version 4 of the patch which uses the fact that we have method > tables to remove one level of indirection. I am not sure the extra lines > of codes are worth it but on the other hand despite 40 more lines the > code became easier to read to me. What do you think? I don't have any major objections, but I noticed a few minor details that might need a bit more tweaking. `signficant` -> `significant` `realtively` -> `relatively` `if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))` -> `if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)` -- Regards, Man Zeng www.openhalo.org
On 1/3/26 4:05 AM, zengman wrote: > I don't have any major objections, but I noticed a few minor details that might need a bit more tweaking. > > `signficant` -> `significant` > `realtively` -> `relatively` > `if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))` -> `if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)` Fixed, thanks! Andreas
Вложения
On Sat, 2026-01-03 at 04:12 +0100, Andreas Karlsson wrote:
> On 1/3/26 4:05 AM, zengman wrote:
> > I don't have any major objections, but I noticed a few minor
> > details that might need a bit more tweaking.
> >
> > `signficant` -> `significant`
> > `realtively` -> `relatively`
> > `if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))` ->
> > `if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)`
> Fixed, thanks!
Committed, thank you!
Regards,
Jeff Davis
Hello Jeff,
07.01.2026 00:10, Jeff Davis wrote:
07.01.2026 00:10, Jeff Davis wrote:
Committed, thank you!
I've discovered that starting from c4ff35f10, the following query:
CREATE COLLATION c (provider = icu, locale = 'icu_something');
makes asan detect (maybe dubious, but still..) stack-buffer-overflow:
==21963==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffd386d4e63 at pc 0x650cd7972a76 bp 0x7ffd386d4e00 sp 0x7ffd386d45a8
...
Address 0x7ffd386d4e63 is located in stack of thread T0 at offset 67 in frame
#0 0x650cd86962ef in foldcase_options (.../usr/local/pgsql/bin/postgres+0x12322ef) (BuildId: e441a9634858193e7358e5901e7948606ff5b1b1)
This frame has 2 object(s):
[48, 52) 'status' (line 993)
[64, 67) 'lang' (line 992) <== Memory access at offset 67 overflows this variable
I use a build made with:
CC=gcc-13 CPPFLAGS="-fsanitize=address" LDFLAGS="-fsanitize=address -static-libasan" ./configure --with-icu ...
Could you please have a look?
Best regards,
Alexander