Re: text search: restricting the number of parsed words in headline generation

Поиск

Список

Период

Сортировка

От	Bruce Momjian
Тема	Re: text search: restricting the number of parsed words in headline generation
Дата	6 августа 2014 г. 15:53:41
Msg-id	20140806155329.GL13302@momjian.us обсуждение исходный текст
Ответ на	Re: text search: restricting the number of parsed words in headline generation (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

FYI, I have kept this email from 2011 about poor performance of parsed
words in headline generation.  If someone wants to research it, please
do so:
http://www.postgresql.org/message-id/1314117620.3700.12.camel@dragflick

---------------------------------------------------------------------------

On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> Sushant Sinha <sushant354@gmail.com> writes:
> >> Doesn't this force the headline to be taken from the first N words of
> >> the document, independent of where the match was?  That seems rather
> >> unworkable, or at least unhelpful.
> 
> > In headline generation function, we don't have any index or knowledge of
> > where the match is. We discover the matches by first tokenizing and then
> > comparing the matches with the query tokens. So it is hard to do
> > anything better than first N words.
> 
> After looking at the code in wparser_def.c a bit more, I wonder whether
> this patch is doing what you think it is.  Did you do any profiling to
> confirm that tokenization is where the cost is?  Because it looks to me
> like the match searching in hlCover() is at least O(N^2) in the number
> of tokens in the document, which means it's probably the dominant cost
> for any long document.  I suspect that your patch helps not so much
> because it saves tokenization costs as because it bounds the amount of
> effort spent in hlCover().
> 
> I haven't tried to do anything about this, but I wonder whether it
> wouldn't be possible to eliminate the quadratic blowup by saving more
> state across the repeated calls to hlCover().  At the very least, it
> shouldn't be necessary to find the last query-token occurrence in the
> document from scratch on each and every call.
> 
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document?  How is that helpful?
> 
>             regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: text search: restricting the number of parsed words in headline generation