Re: Rethinking the implementation of ts_headline()
От | Alvaro Herrera |
---|---|
Тема | Re: Rethinking the implementation of ts_headline() |
Дата | |
Msg-id | 20230119101629.vjdjktbol273zjpm@alvherre.pgsql обсуждение исходный текст |
Ответ на | Re: Rethinking the implementation of ts_headline() (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Rethinking the implementation of ts_headline()
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
On 2023-Jan-18, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > and was surprised that the match for the 'day & drink' arm of the OR > > disappears from the reported headline. > > I'd argue that that's exactly what should happen. It's supposed to > find as-short-as-possible cover strings that satisfy the query. OK, that makes sense. > I don't find anything particularly principled about the old behavior: > > > <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵ > > We stuck ... motion, ↵ > > As <b>idle</b> as a <b>painted</b> Ship ↵ > > Upon > > It's including hits for "day" into the cover despite the lack of any > nearby match to "drink". I suppose it would be possible to put 'day' and 'drink' in two different fragments: since the query has a & operator for them, the words don't necessarily have to be nearby. But OK, your argument for this being the shortest result is sensible. I do wonder, though, if it's effectively usable for somebody building a search interface on top. If I'm ranking the results from several documents, and this document comes on top of others that only match one arm of the OR query, then I would like to be able to show the matches for both arms of the OR. > > Another thing I think might be a regression is the way fragments are > > selected. Consider what happens if I change the "idle & painted" in the > > earlier query to "idle <-> painted", and MaxWords is kept low: > > Of course, "idle <-> painted" is satisfied nowhere in this text > (the words are there, but not adjacent). Oh, I see the problem, and it is my misunderstanding: the <-> operator is counting the words in between, even if they are stop words. I understood from the docs that those words were ignored, but that is not so. I misread the phraseto_tsquery doc as though they were explaining the <-> operator. Another experiment shows that the headline becomes "complete" only if I specify the exact distance in the <N> operator: SELECT dist, ts_headline('simple', 'As idle as a painted Ship', to_tsquery('simple', format('(idle <%s> painted)', dist)),'MaxFragments=5, MaxWords=8, MinWords=4') from generate_series(1, 4) dist; dist │ ts_headline ──────┼────────────────────────────────────── 1 │ As <b>idle</b> as a 2 │ As <b>idle</b> as a 3 │ <b>idle</b> as a <b>painted</b> Ship 4 │ As <b>idle</b> as a (4 filas) I again have to question how valuable in practice is a <N> operator that's so strict that I have to know exactly how many stop words I want there to be in between the phrase search. For some reason, in my mind I had it as "at most N words, ignoring stop words", but that's not what it is. Anyway, I don't think this needs to stop your current patch. > So the cover has to run from the last 'day' to the 'drink'. I think > v15 is deciding that it runs from the first 'day' to the 'drink', > which while not exactly wrong is not the shortest cover. Sounds fair. > The rest of this is just minor variations in what mark_hl_fragments() > decides to do based on the precise cover string it's given. I don't > dispute that mark_hl_fragments() could be improved, but this patch > doesn't touch its algorithm and I think that doing so should be > material for a different patch. Understood and agreed. > > (Both 15 and master highlight 'painted' in the "Upon a painted Ocean" > > verse, which perhaps they shouldn't do, since it's not preceded by > > 'idle'.) > > Yeah, and 'idle' too. Once it's chosen a string to show, it'll > highlight all the query words within that string, whether they > constitute part of the match or not. I can see arguments on both > sides of doing it that way; it was probably more sensible before > we had phrase match than it is now. But again, changing that phase > of the processing is outside the scope of this patch. I'm just > trying to undo the damage I did to the cover-string-selection phase. All clear then. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "The eagle never lost so much time, as when he submitted to learn of the crow." (William Blake)
В списке pgsql-hackers по дате отправления:
Предыдущее
От: shveta malikДата:
Сообщение: Re: Perform streaming logical transactions by background workers and parallel apply
Следующее
От: shveta malikДата:
Сообщение: Re: Perform streaming logical transactions by background workers and parallel apply