Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
От | Andres Freund |
---|---|
Тема | Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version |
Дата | |
Msg-id | 200911081700.53726.andres@anarazel.de обсуждение исходный текст |
Ответ на | [PATCH] tsearch parser inefficiency if text includes urls or emails (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: [PATCH] tsearch parser inefficiency if text includes
urls or emails - new version
Re: tsearch parser inefficiency if text includes urls or emails - new version |
Список | pgsql-hackers |
On Sunday 01 November 2009 16:19:43 Andres Freund wrote: > While playing around/evaluating tsearch I notices that to_tsvector is > obscenely slow for some files. After some profiling I found that this is > due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If > a multibyte encoding is in use TParserInit copies the whole remaining > input and converts it to wchar_t or pg_wchar - for every email or protocol > prefixed url in the the document. Which obviously is bad. > > I solved the issue by having a seperate TParserCopyInit/TParserCopyClose > which reuses the the already converted strings of the original TParser - > only at different offsets. > > Another approach would be to get rid of the separate parser invocations - > requiring a bunch of additional states. This seemed more complex to me, so > I wanted to get some feedback first. > > Without patch: > andres=# SELECT to_tsvector('english', document) FROM document WHERE > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > ────────────────────────────────────────────────────────────────────────── > ─────────────────────────── ... > (1 row) > > Time: 5835.676 ms > > With patch: > andres=# SELECT to_tsvector('english', document) FROM document WHERE > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > ────────────────────────────────────────────────────────────────────────── > ─────────────────────────── ... > (1 row) > > Time: 395.341 ms > > Ill cleanup the patch if it seems like a sensible solution... As nobody commented here is a corrected (stupid thinko) and cleaned up version. Anyone cares to comment whether I am the only one thinking this is an issue? Andres
В списке pgsql-hackers по дате отправления: