Re: text search: restricting the number of parsed words in headline generation
От | Bruce Momjian |
---|---|
Тема | Re: text search: restricting the number of parsed words in headline generation |
Дата | |
Msg-id | 20140806155329.GL13302@momjian.us обсуждение исходный текст |
Ответ на | Re: text search: restricting the number of parsed words in headline generation (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
FYI, I have kept this email from 2011 about poor performance of parsed words in headline generation. If someone wants to research it, please do so: http://www.postgresql.org/message-id/1314117620.3700.12.camel@dragflick --------------------------------------------------------------------------- On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote: > Sushant Sinha <sushant354@gmail.com> writes: > >> Doesn't this force the headline to be taken from the first N words of > >> the document, independent of where the match was? That seems rather > >> unworkable, or at least unhelpful. > > > In headline generation function, we don't have any index or knowledge of > > where the match is. We discover the matches by first tokenizing and then > > comparing the matches with the query tokens. So it is hard to do > > anything better than first N words. > > After looking at the code in wparser_def.c a bit more, I wonder whether > this patch is doing what you think it is. Did you do any profiling to > confirm that tokenization is where the cost is? Because it looks to me > like the match searching in hlCover() is at least O(N^2) in the number > of tokens in the document, which means it's probably the dominant cost > for any long document. I suspect that your patch helps not so much > because it saves tokenization costs as because it bounds the amount of > effort spent in hlCover(). > > I haven't tried to do anything about this, but I wonder whether it > wouldn't be possible to eliminate the quadratic blowup by saving more > state across the repeated calls to hlCover(). At the very least, it > shouldn't be necessary to find the last query-token occurrence in the > document from scratch on each and every call. > > Actually, this code seems probably flat-out wrong: won't every > successful call of hlCover() on a given document return exactly the same > q value (end position), namely the last token occurrence in the > document? How is that helpful? > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
В списке pgsql-hackers по дате отправления: