Re: Unicode normalization SQL functions
От | Peter Eisentraut |
---|---|
Тема | Re: Unicode normalization SQL functions |
Дата | |
Msg-id | 2309023a-6f69-f049-70e5-3c70b4fb9672@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: Unicode normalization SQL functions ("Daniel Verite" <daniel@manitou-mail.org>) |
Ответы |
Re: Unicode normalization SQL functions
|
Список | pgsql-hackers |
On 2020-01-06 17:00, Daniel Verite wrote: > Peter Eisentraut wrote: > >> Also, there is a way to optimize the "is normalized" test for common >> cases, described in UTR #15. For that we'll need an additional data >> file from Unicode. In order to simplify that, I would like my patch >> "Add support for automatically updating Unicode derived files" >> integrated first. > > Would that explain that the NFC/NFKC normalization and "is normalized" > check seem abnormally slow with the current patch, or should > it be regarded independently of the other patch? That's unrelated. > For instance, testing 10000 short ASCII strings: > > postgres=# select count(*) from (select md5(i::text) as t from > generate_series(1,10000) as i) s where t is nfc normalized ; > count > ------- > 10000 > (1 row) > > Time: 2573,859 ms (00:02,574) > > By comparison, the NFD/NFKD case is faster by two orders of magnitude: > > postgres=# select count(*) from (select md5(i::text) as t from > generate_series(1,10000) as i) s where t is nfd normalized ; > count > ------- > 10000 > (1 row) > > Time: 29,962 ms > > Although NFC/NFKC has a recomposition step that NFD/NFKD > doesn't have, such a difference is surprising. It's very likely that this is because the recomposition calls recompose_code() which does a sequential scan of UnicodeDecompMain for each character. To optimize that, we should probably build a bespoke reverse mapping table that can be accessed more efficiently. > I've tried an alternative implementation based on ICU's > unorm2_isNormalized() /unorm2_normalize() functions (which I'm > currently adding to the icu_ext extension to be exposed in SQL). > With these, the 4 normal forms are in the 20ms ballpark with the above > test case, without a clear difference between composed and decomposed > forms. That's good feedback. > Independently of the performance, I've compared the results > of the ICU implementation vs this patch on large series of strings > with all normal forms and could not find any difference. And that too. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-hackers по дате отправления: