Re: WIP: index support for regexp search
От | Heikki Linnakangas |
---|---|
Тема | Re: WIP: index support for regexp search |
Дата | |
Msg-id | 4F187D5C.30701@enterprisedb.com обсуждение исходный текст |
Ответ на | WIP: index support for regexp search (Alexander Korotkov <aekorotkov@gmail.com>) |
Ответы |
Re: WIP: index support for regexp search
Re: WIP: index support for regexp search |
Список | pgsql-hackers |
On 22.11.2011 21:38, Alexander Korotkov wrote: > WIP patch with index support for regexp search for pg_trgm contrib is > attached. > In spite of techniques which extracts continuous text parts from regexp, > this patch presents technique of automatum transformation. That allows more > comprehensive trigrams extraction. Nice! > Current version of patch have some limitations: > 1) Algorithm of logical expression extraction on trigrams have high > computational complexity. So, it can become really slow on regexp with many > branches. Probably, improvements of this algorithm is possible. > 2) Surely, no perfomance benefit if no trigrams can be extracted from > regexp. It's inevitably. > 3) Currently, only GIN index is supported. There are no serious problems, > GiST code for it just not written yet. > 4) It appear to be some kind of problem to extract multibyte encoded > character from pg_wchar. I've posted question about it here: > http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php > While I've hardcoded some dirty solution. So > PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not > supported yet. This is pretty far from being in committable state, so I'm going to mark this as "returned with feedback" in the commitfest app. The feedback: The code badly needs comments. There is no explanation of how the trigram extraction code in trgm_regexp.c works. Guessing from the variable names, it seems to be some sort of a coloring algorithm that works on a graph, but that all needs to be explained. Can this algorithm be found somewhere in literature, perhaps? A link to a paper would be nice. Apart from that, the multibyte issue seems like the big one. Any way around that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: