Re: Rethinking the implementation of ts_headline()

Поиск
Список
Период
Сортировка
От Alvaro Herrera
Тема Re: Rethinking the implementation of ts_headline()
Дата
Msg-id 20230119101629.vjdjktbol273zjpm@alvherre.pgsql
обсуждение исходный текст
Ответ на Re: Rethinking the implementation of ts_headline()  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Rethinking the implementation of ts_headline()  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 2023-Jan-18, Tom Lane wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> > and was surprised that the match for the 'day & drink' arm of the OR
> > disappears from the reported headline.
> 
> I'd argue that that's exactly what should happen.  It's supposed to
> find as-short-as-possible cover strings that satisfy the query.

OK, that makes sense.

> I don't find anything particularly principled about the old behavior:
> 
> >  <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
> >    We stuck ... motion,                                   ↵
> >  As <b>idle</b> as a <b>painted</b> Ship                  ↵
> >    Upon
> 
> It's including hits for "day" into the cover despite the lack of any
> nearby match to "drink".

I suppose it would be possible to put 'day' and 'drink' in two different
fragments: since the query has a & operator for them, the words don't
necessarily have to be nearby.  But OK, your argument for this being the
shortest result is sensible.

I do wonder, though, if it's effectively usable for somebody building a
search interface on top.  If I'm ranking the results from several
documents, and this document comes on top of others that only match one
arm of the OR query, then I would like to be able to show the matches
for both arms of the OR.

> > Another thing I think might be a regression is the way fragments are
> > selected.  Consider what happens if I change the "idle & painted" in the
> > earlier query to "idle <-> painted", and MaxWords is kept low:
> 
> Of course, "idle <-> painted" is satisfied nowhere in this text
> (the words are there, but not adjacent).

Oh, I see the problem, and it is my misunderstanding: the <-> operator
is counting the words in between, even if they are stop words.  I
understood from the docs that those words were ignored, but that is not
so.  I misread the phraseto_tsquery doc as though they were explaining
the <-> operator.

Another experiment shows that the headline becomes "complete" only if I
specify the exact distance in the <N> operator:

SELECT dist, ts_headline('simple', 'As idle as a painted Ship', to_tsquery('simple', format('(idle <%s> painted)',
dist)),'MaxFragments=5, MaxWords=8, MinWords=4') from generate_series(1, 4) dist;
 
 dist │             ts_headline              
──────┼──────────────────────────────────────
    1 │ As <b>idle</b> as a
    2 │ As <b>idle</b> as a
    3 │ <b>idle</b> as a <b>painted</b> Ship
    4 │ As <b>idle</b> as a
(4 filas)

I again have to question how valuable in practice is a <N> operator
that's so strict that I have to know exactly how many stop words I want
there to be in between the phrase search.  For some reason, in my mind I
had it as "at most N words, ignoring stop words", but that's not what it
is.

Anyway, I don't think this needs to stop your current patch.

> So the cover has to run from the last 'day' to the 'drink'.  I think
> v15 is deciding that it runs from the first 'day' to the 'drink',
> which while not exactly wrong is not the shortest cover.

Sounds fair.

> The rest of this is just minor variations in what mark_hl_fragments()
> decides to do based on the precise cover string it's given.  I don't
> dispute that mark_hl_fragments() could be improved, but this patch
> doesn't touch its algorithm and I think that doing so should be
> material for a different patch.

Understood and agreed.

> > (Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
> > verse, which perhaps they shouldn't do, since it's not preceded by
> > 'idle'.)
> 
> Yeah, and 'idle' too.  Once it's chosen a string to show, it'll
> highlight all the query words within that string, whether they
> constitute part of the match or not.  I can see arguments on both
> sides of doing it that way; it was probably more sensible before
> we had phrase match than it is now.  But again, changing that phase
> of the processing is outside the scope of this patch.  I'm just
> trying to undo the damage I did to the cover-string-selection phase.

All clear then.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)



В списке pgsql-hackers по дате отправления:

Предыдущее
От: shveta malik
Дата:
Сообщение: Re: Perform streaming logical transactions by background workers and parallel apply
Следующее
От: shveta malik
Дата:
Сообщение: Re: Perform streaming logical transactions by background workers and parallel apply