Re: HTML tags and tsearch2
От | Oleg Bartunov |
---|---|
Тема | Re: HTML tags and tsearch2 |
Дата | |
Msg-id | Pine.LNX.4.64.0806261602120.11363@sn.sai.msu.ru обсуждение исходный текст |
Ответ на | HTML tags and tsearch2 (Joanna Sharman <Joanna.Sharman@ed.ac.uk>) |
Список | pgsql-general |
On Thu, 26 Jun 2008, Joanna Sharman wrote: > Hi, > > I have recently started experimenting with tsearch2 and it seems that the > default behaviour is to ignore HTML tags and treat them as word-separators. > What I would like it to do is to ignore HTML tags within words, but instead > of creating separate words, combine the characters separated by the tag into > one word. > > For example: in the database I have words like 'K<sub>ir</sub>' that need to > be searched using the term without HTML tags, i.e. 'Kir'. Currently, the HTML > tags are ignored and two words are stored in the vector, 'k' and 'ir'. I > would like only one word, 'kir', to be stored in the vector, so that searches > using the word 'kir' will match the row. 2 options - write HTML parser and preprocess text before to_tsvector. > > A second, related question is whether it is possible to cause tsearch2 to > split up words when it encounters digits, e.g. 'TM8' into 'TM' and '8'. you can write your own dictionary or use dict_regex from http://vo.astronet.ru/arxiv/dict_regex.html > > I am not sure if this functionality is possible to implement using tsearch2 > or if there might be a better way, so I would be grateful for any advice or > pointers to further reading on how I might do this. (I am using PostgreSQL > version 8.1.10) think about upgrading to 8.3 > > Many thanks in advance, > Joanna > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
В списке pgsql-general по дате отправления: