[PATCH] tsearch parser inefficiency if text includes urls or emails
От | Andres Freund |
---|---|
Тема | [PATCH] tsearch parser inefficiency if text includes urls or emails |
Дата | |
Msg-id | 200911011619.44683.andres@anarazel.de обсуждение исходный текст |
Ответы |
Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
|
Список | pgsql-hackers |
Hi, While playing around/evaluating tsearch I notices that to_tsvector is obscenely slow for some files. After some profiling I found that this is due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If a multibyte encoding is in use TParserInit copies the whole remaining input and converts it to wchar_t or pg_wchar - for every email or protocol prefixed url in the the document. Which obviously is bad. I solved the issue by having a seperate TParserCopyInit/TParserCopyClose which reuses the the already converted strings of the original TParser - only at different offsets. Another approach would be to get rid of the separate parser invocations - requiring a bunch of additional states. This seemed more complex to me, so I wanted to get some feedback first. Without patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ───────────────────────────────────────────────────────────────────────────────────────────────────── ...(1 row) Time: 5835.676 ms With patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ───────────────────────────────────────────────────────────────────────────────────────────────────── ...(1 row) Time: 395.341 ms Ill cleanup the patch if it seems like a sensible solution... Is this backpatch-worthy? Andres PS: I let the additional define in for the moment so that its easier to see the performance differences.
В списке pgsql-hackers по дате отправления: