Re: BUG #16235: ts_rank ignores match and only considers lowerweighted vector
От | Dominik Giger |
---|---|
Тема | Re: BUG #16235: ts_rank ignores match and only considers lowerweighted vector |
Дата | |
Msg-id | CAGFNN0Y1KP_tjeAvaHqYr6fR3kEngbQeAyFaj7wF+1NaUEUAqw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #16235: ts_rank ignores match and only considers lower weighted vector (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-bugs |
On Mon, Jan 27, 2020 at 11:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > PG Bug reporting form <noreply@postgresql.org> writes: > > The following query shows the problem: > > > select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as > > rank_correct > > from (select setweight(to_tsvector('simple', 'foo something'), 'A') || > > setweight(to_tsvector('simple', 'foobar'), 'C') as doc1, > > setweight(to_tsvector('simple', 'foo something'), 'A') as > > doc2, > > to_tsquery('simple', 'foo:* & something') as > > query) as subquery; > > > ts_rank on doc1 is only half of the rank of doc2. ts_rank seems to only > > consider the 'foobar' term with lower weight when calculating the rank. The > > foo:1A is only considered in doc2. > > No, that's not correct. What it actually is doing is taking some sort of > average of the weights of the occurrences, as you can see if you play > around with a few more examples besides these two. That could be better > documented, perhaps, but I don't think it's obviously broken. > > I can see that there might be a use for taking the max or even the sum > of the weights rather than an average --- in many situations it wouldn't > be desirable to rank doc1 of your example lower than doc2. But really > that'd be a different ranking algorithm, not a bug fix for this one. > > The manual claims you can write your own ranking algorithm ... but > AFAICS you'd have to code it in C, because we aren't exposing anything > at SQL level that would let you get at the raw match data :-(. > So there's room for improvement there. > > Also, you might try using ts_rank_cd() instead, as that uses a different > algorithm for combining the weights. At least on this example, doc1 > gets a higher score than doc2. > > regards, tom lane I see, thank you for the explanation. Maybe I can add another reason why I think it might be a bug. Consider this query: select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as rank_correct from (select setweight(to_tsvector('simple', 'foo something'), 'A') || setweight(to_tsvector('simple', 'foobar'), 'C') as doc1, setweight(to_tsvector('simple', 'foo something'), 'A') as doc2, to_tsquery('simple', 'foo:*') as query) as subquery; Here I only removed the '& something' part of the query. Now the query behaves as one would expect: The first rank is higher than the second. I am unsure why adding a second search term (which is contained in both documents) would lead to a change in the ranking order. What do you think? Regards, Dominik Giger
В списке pgsql-bugs по дате отправления: