Re: Unicode normalization SQL functions

Поиск

Список

Период

Сортировка

От	Peter Eisentraut
Тема	Re: Unicode normalization SQL functions
Дата	9 января 2020 г. 09:20:14
Msg-id	2309023a-6f69-f049-70e5-3c70b4fb9672@2ndquadrant.com обсуждение исходный текст
Ответ на	Re: Unicode normalization SQL functions ("Daniel Verite" <daniel@manitou-mail.org>)
Ответы	Re: Unicode normalization SQL functions
Список	pgsql-hackers

Дерево обсуждения

On 2020-01-06 17:00, Daniel Verite wrote:
>     Peter Eisentraut wrote:
> 
>> Also, there is a way to optimize the "is normalized" test for common
>> cases, described in UTR #15.  For that we'll need an additional data
>> file from Unicode.  In order to simplify that, I would like my patch
>> "Add support for automatically updating Unicode derived files"
>> integrated first.
> 
> Would that explain that the NFC/NFKC normalization and "is normalized"
> check seem abnormally slow with the current patch, or should
> it be regarded independently of the other patch?

That's unrelated.

> For instance, testing 10000 short ASCII strings:
> 
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfc normalized ;
>   count
> -------
>   10000
> (1 row)
> 
> Time: 2573,859 ms (00:02,574)
> 
> By comparison, the NFD/NFKD case is faster by two orders of magnitude:
> 
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfd normalized ;
>   count
> -------
>   10000
> (1 row)
> 
> Time: 29,962 ms
> 
> Although NFC/NFKC has a recomposition step that NFD/NFKD
> doesn't have, such a difference is surprising.

It's very likely that this is because the recomposition calls 
recompose_code() which does a sequential scan of UnicodeDecompMain for 
each character.  To optimize that, we should probably build a bespoke 
reverse mapping table that can be accessed more efficiently.

> I've tried an alternative implementation based on ICU's
> unorm2_isNormalized() /unorm2_normalize() functions (which I'm
> currently adding to the icu_ext extension to be exposed in SQL).
> With these, the 4 normal forms are in the 20ms ballpark with the above
> test case, without a clear difference between composed and decomposed
> forms.

That's good feedback.

> Independently of the performance, I've compared the results
> of the ICU implementation vs this patch on large series of strings
> with all normal forms and could not find any difference.

And that too.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Unicode normalization SQL functions