Re: BUG #16235: ts_rank ignores match and only considers lowerweighted vector

Поиск

Список

Период

Сортировка

От	Dominik Giger
Тема	Re: BUG #16235: ts_rank ignores match and only considers lowerweighted vector
Дата	28 января 2020 г. 10:50:20
Msg-id	CAGFNN0Y1KP_tjeAvaHqYr6fR3kEngbQeAyFaj7wF+1NaUEUAqw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: BUG #16235: ts_rank ignores match and only considers lower weighted vector (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-bugs

Дерево обсуждения

On Mon, Jan 27, 2020 at 11:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> PG Bug reporting form <noreply@postgresql.org> writes:
> > The following query shows the problem:
>
> > select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as
> > rank_correct
> > from (select setweight(to_tsvector('simple', 'foo something'), 'A') ||
> >              setweight(to_tsvector('simple', 'foobar'), 'C')    as doc1,
> >              setweight(to_tsvector('simple', 'foo something'), 'A') as
> > doc2,
> >              to_tsquery('simple', 'foo:* & something')               as
> > query) as subquery;
>
> > ts_rank on doc1 is only half of the rank of doc2. ts_rank seems to only
> > consider the 'foobar' term with lower weight when calculating the rank. The
> > foo:1A is only considered in doc2.
>
> No, that's not correct.  What it actually is doing is taking some sort of
> average of the weights of the occurrences, as you can see if you play
> around with a few more examples besides these two.  That could be better
> documented, perhaps, but I don't think it's obviously broken.
>
> I can see that there might be a use for taking the max or even the sum
> of the weights rather than an average --- in many situations it wouldn't
> be desirable to rank doc1 of your example lower than doc2.  But really
> that'd be a different ranking algorithm, not a bug fix for this one.
>
> The manual claims you can write your own ranking algorithm ... but
> AFAICS you'd have to code it in C, because we aren't exposing anything
> at SQL level that would let you get at the raw match data :-(.
> So there's room for improvement there.
>
> Also, you might try using ts_rank_cd() instead, as that uses a different
> algorithm for combining the weights.  At least on this example, doc1
> gets a higher score than doc2.
>
>                         regards, tom lane

I see, thank you for the explanation.

Maybe I can add another reason why I think it might be a bug. Consider
this query:

select ts_rank(doc1, query) as rank_wrong,
       ts_rank(doc2, query) as rank_correct
from (select setweight(to_tsvector('simple', 'foo something'), 'A') ||
             setweight(to_tsvector('simple', 'foobar'), 'C')        as doc1,
             setweight(to_tsvector('simple', 'foo something'), 'A') as doc2,
             to_tsquery('simple', 'foo:*')                          as
query) as subquery;

Here I only removed the '& something' part of the query. Now the query
behaves as one would expect: The first rank is higher than the second.
I am unsure why adding a second search term (which is contained in
both documents) would lead to a change in the ranking order.

What do you think?

Regards,
Dominik Giger

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #16235: ts_rank ignores match and only considers lowerweighted vector