Re: processing urls with tsearch2
От | Oleg Bartunov |
---|---|
Тема | Re: processing urls with tsearch2 |
Дата | |
Msg-id | Pine.LNX.4.64.0709132258040.2767@sn.sai.msu.ru обсуждение исходный текст |
Ответ на | processing urls with tsearch2 ("Laimonas Simutis" <laimis@gmail.com>) |
Ответы |
Re: processing urls with tsearch2
|
Список | pgsql-general |
On Thu, 13 Sep 2007, Laimonas Simutis wrote: > Hey guys, > > maybe anyone using tsearch2 could advise on this. With the default > installation, url, host and some other tokens are processed with the simple > dictionary. Thus term like mywebsite.com gets stored as 'mywebsite.com'. The > parser correctly assigns token id of type host to the term, but then the > dictionary the terms gets routed through is simple and what gets stored is > mywebsite.com > > The questions are: > > 1) is there a dictionary available that I could utilize that will remove > .com, .net, .org, etc? I could write one myself, but after seeing some > sample dictionary implementations and C code I try to avoid, I got scared a > bit. Yes, we have dict_regex, which was developed by Sergey Karpov, see details http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html It uses pcre library and you need to know perl regexps. > > 2) has anyone else dealt with this maybe in a different way? sure, preprocess text using prefered language before passing to ro_tsvector > > > Thanks for any suggestions and help, > > Laimis > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
В списке pgsql-general по дате отправления: