Обсуждение: Rethinking the implementation of ts_headline()

Поиск

Список

Период

Сортировка

Rethinking the implementation of ts_headline()

От

Tom Lane

Дата:

25 ноября 2022 г., 22:52:15

After further contemplation of bug #17691 [1], I've concluded that
what I did in commit c9b0c678d was largely misguided.  For one
thing, the new hlCover() algorithm no longer finds shortest-possible
cover strings: if your query is "x & y" and the text is like
"... x ... x ... y ...", then the selected cover string will run
from the first occurrence of x to the y, whereas the old algorithm
would have correctly selected "x ... y".  For another thing, the
maximum-cover-length hack that I added in 78e73e875 to band-aid
over the performance issues of the original c9b0c678d patch means
that various scenarios no longer work as well as they used to,
which is the proximate cause of the complaints in bug #17691.

What I'm now thinking is that the original hlCover() algorithm was
fine (if underdocumented) *as long as it's only asked to deal with
AND-like semantics*.  Given that restriction, its approach of
finding the latest first occurrence of any query term, and then
backing up to the earliest last occurrence, is visibly correct;
and we weren't hearing of any performance issues with it either.
The problem is that this approach fails miserably for tsqueries
that aren't pure ANDs, because it will insist on including query
terms that aren't required for a match and indeed could prevent a
match.

So what we need is to find a way to fold the other sorts of queries
into an AND context.  After a couple of false starts, I came up
with the attached patch.  It builds on the existing TS_phrase_execute
infrastructure, which already produces an ExecPhraseData structure
containing an exact list of the match locations for a phrase subquery.
It's easy to get an ExecPhraseData structure for a single-lexeme
match, too.  We can handle plain ANDs by forming lists of
ExecPhraseData structs (with implicit AND semantics across the list).
We can handle ORs by union'ing the component ExecPhraseData structs.
And we can handle NOTs by just dropping them, because we don't want
ts_headline to prioritize matches to such words.  There are some fine
points but it all seems to work pretty well.

Hence I propose the attached patchset.  0001 adds some test cases that
I felt necessary after examining code coverage reports; I split this
out mainly so that 0002 clearly shows the behavioral changes from
current code.  0002 adds TS_execute_locations() which does what I just
described, and rewrites hlCover() into something that's a spiritual
descendant of the original algorithm.  It can't be quite the same
as before because the location data that TS_execute_locations() returns
is measured in lexeme indexes not word indexes, so some translation is
needed.

Notable here is that a couple of regression test results revert
to what they were before c9b0c678d.  They're both instances of
preferring "x ... x ... y" to "x ... y", which I argued was okay
at the time, but I now see the error in that.

Although we back-patched c9b0c678d, I'm inclined to propose this
for HEAD only.  The misbehaviors it's fixing are less bad than what
we were dealing with in that patch.

BTW, while experimenting with this I realized that tsvector's limitation
of lexeme indexes to be at most 16383 is really quite disastrous for
phrase searches.  That limitation was arguably okay before we had phrase
searching, but now it seems untenable.  For example, tsvector entries like
"foo:16383 bar:16383" will not match "foo <-> bar" because the phrase
match code wants their lexeme positions to differ by one.  This basically
means that phrase searches do not work beyond ~20K words into a document.
I'm not sure what to do about that exactly, but I think we need to do
something.

Whatever we might do about that would be largely orthogonal to this
patch, in any case.  I'll stick this in the January CF.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/17691-93cef39a14663963%40postgresql.org

diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index dc03f15499..e6a01cf985 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1804,6 +1804,123 @@ S. T. Coleridge (1772-1834)
  Water, water, every where
 (1 row)

+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'day & drink'));
+                        ts_headline
+-----------------------------------------------------------
+ <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,+
+   We stuck, nor breath nor motion,                       +
+ As idle as a painted Ship                                +
+   Upon a painted Ocean.                                  +
+ Water, water, every where                                +
+   And all the boards did shrink;                         +
+ Water, water, every
+(1 row)
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'day | drink'));
+                        ts_headline
+-----------------------------------------------------------
+ <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,+
+   We stuck, nor breath nor motion,                       +
+ As idle as a painted
+(1 row)
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'day | !drink'));
+                        ts_headline
+-----------------------------------------------------------
+ <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,+
+   We stuck, nor breath nor motion,                       +
+ As idle as a painted
+(1 row)
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'painted <-> Ship & drink'));
+           ts_headline
+----------------------------------
+ <b>painted</b> <b>Ship</b>      +
+   Upon a <b>painted</b> Ocean.  +
+ Water, water, every where       +
+   And all the boards did shrink;+
+ Water, water, every where,      +
+   Nor any drop to <b>drink</b>
+(1 row)
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'painted <-> Ship | drink'));
+           ts_headline
+---------------------------------
+ <b>painted</b> <b>Ship</b>     +
+   Upon a <b>painted</b> Ocean. +
+ Water, water, every where      +
+   And all the boards did shrink
+(1 row)
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'painted <-> Ship | !drink'));
+           ts_headline
+---------------------------------
+ <b>painted</b> <b>Ship</b>     +
+   Upon a <b>painted</b> Ocean. +
+ Water, water, every where      +
+   And all the boards did shrink
+(1 row)
+
 SELECT ts_headline('english', '
 Day after day, day after day,
   We stuck, nor breath nor motion,
@@ -1851,6 +1968,15 @@ to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
  <b>Lorem</b> ipsum <b>urna</b>.  Nullam nullam <b>ullamcorper</b> <b>urna</b>
 (1 row)

+SELECT ts_headline('english',
+'Lorem ipsum urna.  Nullam nullam ullamcorper urna.',
+phraseto_tsquery('english','ullamcorper urna'),
+'MaxWords=100, MinWords=5');
+                        ts_headline
+------------------------------------------------------------
+ <b>urna</b>.  Nullam nullam <b>ullamcorper</b> <b>urna</b>
+(1 row)
+
 SELECT ts_headline('english', '
 <html>
 <!-- some comment -->
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 0fa8ac4682..b56477a813 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -458,6 +458,78 @@ Water, water, every where,
 S. T. Coleridge (1772-1834)
 ', to_tsquery('english', 'ocean'));

+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'day & drink'));
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'day | drink'));
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'day | !drink'));
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'painted <-> Ship & drink'));
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'painted <-> Ship | drink'));
+
+SELECT ts_headline('english', '
+Day after day, day after day,
+  We stuck, nor breath nor motion,
+As idle as a painted Ship
+  Upon a painted Ocean.
+Water, water, every where
+  And all the boards did shrink;
+Water, water, every where,
+  Nor any drop to drink.
+S. T. Coleridge (1772-1834)
+', to_tsquery('english', 'painted <-> Ship | !drink'));
+
 SELECT ts_headline('english', '
 Day after day, day after day,
   We stuck, nor breath nor motion,
@@ -487,6 +559,11 @@ SELECT ts_headline('english',
 to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
 'MaxWords=100, MinWords=1');

+SELECT ts_headline('english',
+'Lorem ipsum urna.  Nullam nullam ullamcorper urna.',
+phraseto_tsquery('english','ullamcorper urna'),
+'MaxWords=100, MinWords=5');
+
 SELECT ts_headline('english', '
 <html>
 <!-- some comment -->
diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 2323a3b908..d7f28c4524 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -1957,6 +1957,10 @@ typedef struct

 /*
  * TS_execute callback for matching a tsquery operand to headline words
+ *
+ * Note: it's tempting to report words[] indexes as pos values to save
+ * searching in hlCover; but that would screw up phrase matching, which
+ * expects to measure distances in lexemes not tokens.
  */
 static TSTernaryValue
 checkcondition_HL(void *opaque, QueryOperand *val, ExecPhraseData *data)
@@ -1964,7 +1968,7 @@ checkcondition_HL(void *opaque, QueryOperand *val, ExecPhraseData *data)
     hlCheck    *checkval = (hlCheck *) opaque;
     int            i;

-    /* scan words array for marching items */
+    /* scan words array for matching items */
     for (i = 0; i < checkval->len; i++)
     {
         if (checkval->words[i].item == val)
@@ -1993,35 +1997,15 @@ checkcondition_HL(void *opaque, QueryOperand *val, ExecPhraseData *data)
     return TS_NO;
 }

-/*
- * hlFirstIndex: find first index >= pos containing any word used in query
- *
- * Returns -1 if no such index
- */
-static int
-hlFirstIndex(HeadlineParsedText *prs, int pos)
-{
-    int            i;
-
-    for (i = pos; i < prs->curwords; i++)
-    {
-        if (prs->words[i].item != NULL)
-            return i;
-    }
-    return -1;
-}
-
 /*
  * hlCover: try to find a substring of prs' word list that satisfies query
  *
- * At entry, *p must be the first word index to consider (initialize this
- * to zero, or to the next index after a previous successful search).
- * We will consider all substrings starting at or after that word, and
- * containing no more than max_cover words.  (We need a length limit to
- * keep this from taking O(N^2) time for a long document with many query
- * words but few complete matches.  Actually, since checkcondition_HL is
- * roughly O(N) in the length of the substring being checked, it's even
- * worse than that.)
+ * locations is the result of TS_execute_locations() for the query.
+ * We use this to identify plausible subranges of the query.
+ *
+ * *nextpos is the lexeme position (NOT word index) to start the search
+ * at.  Caller should initialize this to zero.  If successful, we'll
+ * advance it to the next place to search at.
  *
  * On success, sets *p to first word index and *q to last word index of the
  * cover substring, and returns true.
@@ -2030,57 +2014,149 @@ hlFirstIndex(HeadlineParsedText *prs, int pos)
  * words used in the query.
  */
 static bool
-hlCover(HeadlineParsedText *prs, TSQuery query, int max_cover,
-        int *p, int *q)
+hlCover(HeadlineParsedText *prs, TSQuery query, List *locations,
+        int *nextpos, int *p, int *q)
 {
-    int            pmin,
-                pmax,
-                nextpmin,
-                nextpmax;
-    hlCheck        ch;
+    int            pos = *nextpos;

-    /*
-     * We look for the earliest, shortest substring of prs->words that
-     * satisfies the query.  Both the pmin and pmax indices must be words
-     * appearing in the query; there's no point in trying endpoints in between
-     * such points.
-     */
-    pmin = hlFirstIndex(prs, *p);
-    while (pmin >= 0)
+    /* This loop repeats when our selected word-range fails the query */
+    for (;;)
     {
-        /* This useless assignment just keeps stupider compilers quiet */
-        nextpmin = -1;
-        /* Consider substrings starting at pmin */
-        ch.words = &(prs->words[pmin]);
-        /* Consider the length-one substring first, then longer substrings */
-        pmax = pmin;
-        do
+        int            posb,
+                    pose;
+        ListCell   *lc;
+
+        /*
+         * For each AND'ed query term or phrase, find its first occurrence at
+         * or after pos; set pose to the maximum of those positions.
+         *
+         * We need not consider ORs or NOTs here; see the comments for
+         * TS_execute_locations().  Rechecking the match with TS_execute(),
+         * below, will deal with any ensuing imprecision.
+         */
+        pose = -1;
+        foreach(lc, locations)
+        {
+            ExecPhraseData *pdata = (ExecPhraseData *) lfirst(lc);
+            int            first = -1;
+
+            for (int i = 0; i < pdata->npos; i++)
+            {
+                /* For phrase matches, use the ending lexeme */
+                int            endp = pdata->pos[i];
+
+                if (endp >= pos)
+                {
+                    first = endp;
+                    break;
+                }
+            }
+            if (first < 0)
+                return false;    /* no more matches for this term */
+            if (first > pose)
+                pose = first;
+        }
+
+        if (pose < 0)
+            return false;        /* we only get here if empty list */
+
+        /*
+         * Now, for each AND'ed query term or phrase, find its last occurrence
+         * at or before pose; set posb to the minimum of those positions.
+         *
+         * We start posb at INT_MAX - 1 to guarantee no overflow if we compute
+         * posb + 1 below.
+         */
+        posb = INT_MAX - 1;
+        foreach(lc, locations)
         {
-            /* Try to match query against pmin .. pmax substring */
-            ch.len = pmax - pmin + 1;
-            if (TS_execute(GETQUERY(query), &ch,
-                           TS_EXEC_EMPTY, checkcondition_HL))
+            ExecPhraseData *pdata = (ExecPhraseData *) lfirst(lc);
+            int            last = -1;
+
+            for (int i = pdata->npos - 1; i >= 0; i--)
             {
-                *p = pmin;
-                *q = pmax;
-                return true;
+                /* For phrase matches, use the starting lexeme */
+                int            startp = pdata->pos[i] - pdata->width;
+
+                if (startp <= pose)
+                {
+                    last = startp;
+                    break;
+                }
             }
-            /* Nope, so advance pmax to next feasible endpoint */
-            nextpmax = hlFirstIndex(prs, pmax + 1);
+            if (last < posb)
+                posb = last;
+        }
+
+        /*
+         * We could end up with posb to the left of pos, in case some phrase
+         * match crosses pos.  Try the match starting at pos anyway, since the
+         * result of TS_execute_locations is imprecise for phrase matches OR'd
+         * with plain matches; that is, if the query is "(A <-> B) | C" then C
+         * could match at pos even though the phrase match would have to
+         * extend to the left of pos.
+         */
+        posb = Max(posb, pos);

+        /* This test probably always succeeds, but be paranoid */
+        if (posb <= pose)
+        {
             /*
-             * If this is our first advance past pmin, then the result is also
-             * the next feasible value of pmin; remember it to save a
-             * redundant search.
+             * posb .. pose is now the shortest, earliest-after-pos range of
+             * lexeme positions containing all the query terms.  It will
+             * contain all phrase matches, too, except in the corner case
+             * described just above.
+             *
+             * Now convert these lexeme positions to indexes in prs->words[].
              */
-            if (pmax == pmin)
-                nextpmin = nextpmax;
-            pmax = nextpmax;
+            int            idxb = -1;
+            int            idxe = -1;
+
+            for (int i = 0; i < prs->curwords; i++)
+            {
+                if (prs->words[i].item == NULL)
+                    continue;
+                if (idxb < 0 && prs->words[i].pos >= posb)
+                    idxb = i;
+                if (prs->words[i].pos <= pose)
+                    idxe = i;
+                else
+                    break;
+            }
+
+            /* This test probably always succeeds, but be paranoid */
+            if (idxb >= 0 && idxe >= idxb)
+            {
+                /*
+                 * Finally, check that the selected range satisfies the query.
+                 * This should succeed in all simple cases; but odd cases
+                 * involving non-top-level NOT conditions or phrase matches
+                 * OR'd with other things could fail, since the result of
+                 * TS_execute_locations doesn't fully represent such things.
+                 */
+                hlCheck        ch;
+
+                ch.words = &(prs->words[idxb]);
+                ch.len = idxe - idxb + 1;
+                if (TS_execute(GETQUERY(query), &ch,
+                               TS_EXEC_EMPTY, checkcondition_HL))
+                {
+                    /* Match!  Advance *nextpos and return the word range. */
+                    *nextpos = posb + 1;
+                    *p = idxb;
+                    *q = idxe;
+                    return true;
+                }
+            }
         }
-        while (pmax >= 0 && pmax - pmin < max_cover);
-        /* No luck here, so try next feasible startpoint */
-        pmin = nextpmin;
+
+        /*
+         * Advance pos and try again.  Any later workable match must start
+         * beyond posb.
+         */
+        pos = posb + 1;
     }
+    /* Can't get here, but stupider compilers complain if we leave it off */
     return false;
 }

@@ -2177,9 +2253,10 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos,
  * it only controls presentation details.
  */
 static void
-mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall,
+mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, List *locations,
+                  bool highlightall,
                   int shortword, int min_words,
-                  int max_words, int max_fragments, int max_cover)
+                  int max_words, int max_fragments)
 {
     int32        poslen,
                 curlen,
@@ -2192,6 +2269,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall,

     int32        startpos = 0,
                 endpos = 0,
+                nextpos = 0,
                 p = 0,
                 q = 0;

@@ -2206,7 +2284,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall,
     covers = palloc(maxcovers * sizeof(CoverPos));

     /* get all covers */
-    while (hlCover(prs, query, max_cover, &p, &q))
+    while (hlCover(prs, query, locations, &nextpos, &p, &q))
     {
         startpos = p;
         endpos = q;
@@ -2236,9 +2314,6 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall,
             startpos = endpos + 1;
             endpos = q;
         }
-
-        /* move p to generate the next cover */
-        p++;
     }

     /* choose best covers */
@@ -2360,10 +2435,12 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall,
  * Headline selector used when MaxFragments == 0
  */
 static void
-mark_hl_words(HeadlineParsedText *prs, TSQuery query, bool highlightall,
-              int shortword, int min_words, int max_words, int max_cover)
+mark_hl_words(HeadlineParsedText *prs, TSQuery query, List *locations,
+              bool highlightall,
+              int shortword, int min_words, int max_words)
 {
-    int            p = 0,
+    int            nextpos = 0,
+                p = 0,
                 q = 0;
     int            bestb = -1,
                 beste = -1;
@@ -2379,7 +2456,7 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, bool highlightall,
     if (!highlightall)
     {
         /* examine all covers, select a headline using the best one */
-        while (hlCover(prs, query, max_cover, &p, &q))
+        while (hlCover(prs, query, locations, &nextpos, &p, &q))
         {
             /*
              * Count words (curlen) and interesting words (poslen) within
@@ -2486,9 +2563,6 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, bool highlightall,
                 bestlen = poslen;
                 bestcover = poscover;
             }
-
-            /* move p to generate the next cover */
-            p++;
         }

         /*
@@ -2528,6 +2602,8 @@ prsd_headline(PG_FUNCTION_ARGS)
     HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0);
     List       *prsoptions = (List *) PG_GETARG_POINTER(1);
     TSQuery        query = PG_GETARG_TSQUERY(2);
+    hlCheck        ch;
+    List       *locations;

     /* default option values: */
     int            min_words = 15;
@@ -2535,7 +2611,6 @@ prsd_headline(PG_FUNCTION_ARGS)
     int            shortword = 3;
     int            max_fragments = 0;
     bool        highlightall = false;
-    int            max_cover;
     ListCell   *l;

     /* Extract configuration option values */
@@ -2575,15 +2650,6 @@ prsd_headline(PG_FUNCTION_ARGS)
                             defel->defname)));
     }

-    /*
-     * We might eventually make max_cover a user-settable parameter, but for
-     * now, just compute a reasonable value based on max_words and
-     * max_fragments.
-     */
-    max_cover = Max(max_words * 10, 100);
-    if (max_fragments > 0)
-        max_cover *= max_fragments;
-
     /* in HighlightAll mode these parameters are ignored */
     if (!highlightall)
     {
@@ -2605,13 +2671,19 @@ prsd_headline(PG_FUNCTION_ARGS)
                      errmsg("MaxFragments should be >= 0")));
     }

+    /* Locate words and phrases matching the query */
+    ch.words = prs->words;
+    ch.len = prs->curwords;
+    locations = TS_execute_locations(GETQUERY(query), &ch, TS_EXEC_EMPTY,
+                                     checkcondition_HL);
+
     /* Apply appropriate headline selector */
     if (max_fragments == 0)
-        mark_hl_words(prs, query, highlightall, shortword,
-                      min_words, max_words, max_cover);
+        mark_hl_words(prs, query, locations, highlightall, shortword,
+                      min_words, max_words);
     else
-        mark_hl_fragments(prs, query, highlightall, shortword,
-                          min_words, max_words, max_fragments, max_cover);
+        mark_hl_fragments(prs, query, locations, highlightall, shortword,
+                          min_words, max_words, max_fragments);

     /* Fill in default values for string options */
     if (!prs->startsel)
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index f7c1e3d6d6..cd970bcae2 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -71,6 +71,10 @@ typedef struct
 static TSTernaryValue TS_execute_recurse(QueryItem *curitem, void *arg,
                                          uint32 flags,
                                          TSExecuteCallback chkcond);
+static bool TS_execute_locations_recurse(QueryItem *curitem,
+                                         void *arg,
+                                         TSExecuteCallback chkcond,
+                                         List **locations);
 static int    tsvector_bsearch(const TSVector tsv, char *lexeme, int lexeme_len);
 static Datum tsvector_update_trigger(PG_FUNCTION_ARGS, bool config_column);

@@ -1569,10 +1573,10 @@ TS_phrase_output(ExecPhraseData *data,
  * In addition to the same arguments used for TS_execute, the caller may pass
  * a preinitialized-to-zeroes ExecPhraseData struct, to be filled with lexeme
  * match position info on success.  data == NULL if no position data need be
- * returned.  (In practice, outside callers pass NULL, and only the internal
- * recursion cases pass a data pointer.)
+ * returned.
  * Note: the function assumes data != NULL for operators other than OP_PHRASE.
- * This is OK because an outside call always starts from an OP_PHRASE node.
+ * This is OK because an outside call always starts from an OP_PHRASE node,
+ * and all internal recursion cases pass data != NULL.
  *
  * The detailed semantics of the match data, given that the function returned
  * TS_YES (successful match), are:
@@ -1969,6 +1973,177 @@ TS_execute_recurse(QueryItem *curitem, void *arg, uint32 flags,
     return TS_NO;
 }

+/*
+ * Evaluate tsquery and report locations of matching terms.
+ *
+ * This is like TS_execute except that it returns match locations not just
+ * success/failure status.  The callback function is required to provide
+ * position data (we report failure if it doesn't).
+ *
+ * On successful match, the result is a List of ExecPhraseData structs, one
+ * for each AND'ed term or phrase operator in the query.  Each struct includes
+ * a sorted array of lexeme positions matching that term.  (Recall that for
+ * phrase operators, the match includes width+1 lexemes, and the recorded
+ * position is that of the rightmost lexeme.)
+ *
+ * OR subexpressions are handled by union'ing their match locations into a
+ * single List element, which is valid since any of those locations contains
+ * a match.  However, when some of the OR'ed terms are phrase operators, we
+ * report the maximum width of any of the OR'ed terms, making such cases
+ * slightly imprecise in the conservative direction.  (For example, if the
+ * tsquery is "(A <-> B) | C", an occurrence of C in the data would be
+ * reported as though it includes the lexeme to the left of C.)
+ *
+ * Locations of NOT subexpressions are not reported.  (Obviously, there can
+ * be no successful NOT matches at top level, or the match would have failed.
+ * So this amounts to ignoring NOTs underneath ORs.)
+ *
+ * The result is NIL if no match, or if position data was not returned.
+ *
+ * Arguments are the same as for TS_execute, although flags is currently
+ * vestigial since none of the defined bits are sensible here.
+ */
+List *
+TS_execute_locations(QueryItem *curitem, void *arg,
+                     uint32 flags,
+                     TSExecuteCallback chkcond)
+{
+    List       *result;
+
+    /* No flags supported, as yet */
+    Assert(flags == TS_EXEC_EMPTY);
+    if (TS_execute_locations_recurse(curitem, arg, chkcond, &result))
+        return result;
+    return NIL;
+}
+
+/*
+ * TS_execute_locations recursion for operators above any phrase operator.
+ * OP_PHRASE subexpressions can be passed off to TS_phrase_execute.
+ */
+static bool
+TS_execute_locations_recurse(QueryItem *curitem, void *arg,
+                             TSExecuteCallback chkcond,
+                             List **locations)
+{
+    bool        lmatch,
+                rmatch;
+    List       *llocations,
+               *rlocations;
+    ExecPhraseData *data;
+
+    /* since this function recurses, it could be driven to stack overflow */
+    check_stack_depth();
+
+    /* ... and let's check for query cancel while we're at it */
+    CHECK_FOR_INTERRUPTS();
+
+    /* Default locations result is empty */
+    *locations = NIL;
+
+    if (curitem->type == QI_VAL)
+    {
+        data = palloc0_object(ExecPhraseData);
+        if (chkcond(arg, (QueryOperand *) curitem, data) == TS_YES)
+        {
+            *locations = list_make1(data);
+            return true;
+        }
+        pfree(data);
+        return false;
+    }
+
+    switch (curitem->qoperator.oper)
+    {
+        case OP_NOT:
+            if (!TS_execute_locations_recurse(curitem + 1, arg, chkcond,
+                                              &llocations))
+                return true;    /* we don't pass back any locations */
+            return false;
+
+        case OP_AND:
+            if (!TS_execute_locations_recurse(curitem + curitem->qoperator.left,
+                                              arg, chkcond,
+                                              &llocations))
+                return false;
+            if (!TS_execute_locations_recurse(curitem + 1,
+                                              arg, chkcond,
+                                              &rlocations))
+                return false;
+            *locations = list_concat(llocations, rlocations);
+            return true;
+
+        case OP_OR:
+            lmatch = TS_execute_locations_recurse(curitem + curitem->qoperator.left,
+                                                  arg, chkcond,
+                                                  &llocations);
+            rmatch = TS_execute_locations_recurse(curitem + 1,
+                                                  arg, chkcond,
+                                                  &rlocations);
+            if (lmatch || rmatch)
+            {
+                /*
+                 * We generate an AND'able location struct from each
+                 * combination of sub-matches, following the disjunctive law
+                 * (A & B) | (C & D) = (A | C) & (A | D) & (B | C) & (B | D).
+                 *
+                 * However, if either input didn't produce locations (i.e., it
+                 * failed or was a NOT), we must just return the other list.
+                 */
+                if (llocations == NIL)
+                    *locations = rlocations;
+                else if (rlocations == NIL)
+                    *locations = llocations;
+                else
+                {
+                    ListCell   *ll;
+
+                    foreach(ll, llocations)
+                    {
+                        ExecPhraseData *ldata = (ExecPhraseData *) lfirst(ll);
+                        ListCell   *lr;
+
+                        foreach(lr, rlocations)
+                        {
+                            ExecPhraseData *rdata = (ExecPhraseData *) lfirst(lr);
+
+                            data = palloc0_object(ExecPhraseData);
+                            (void) TS_phrase_output(data, ldata, rdata,
+                                                    TSPO_BOTH | TSPO_L_ONLY | TSPO_R_ONLY,
+                                                    0, 0,
+                                                    ldata->npos + rdata->npos);
+                            /* Report the larger width, as explained above. */
+                            data->width = Max(ldata->width, rdata->width);
+                            *locations = lappend(*locations, data);
+                        }
+                    }
+                }
+
+                return true;
+            }
+            return false;
+
+        case OP_PHRASE:
+            /* We can hand this off to TS_phrase_execute */
+            data = palloc0_object(ExecPhraseData);
+            if (TS_phrase_execute(curitem, arg, TS_EXEC_EMPTY, chkcond,
+                                  data) == TS_YES)
+            {
+                if (!data->negate)
+                    *locations = list_make1(data);
+                return true;
+            }
+            pfree(data);
+            return false;
+
+        default:
+            elog(ERROR, "unrecognized operator: %d", curitem->qoperator.oper);
+    }
+
+    /* not reachable, but keep compiler quiet */
+    return false;
+}
+
 /*
  * Detect whether a tsquery boolean expression requires any positive matches
  * to values shown in the tsquery.
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index 6fdd334fff..efefac6874 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -202,6 +202,9 @@ extern bool TS_execute(QueryItem *curitem, void *arg, uint32 flags,
 extern TSTernaryValue TS_execute_ternary(QueryItem *curitem, void *arg,
                                          uint32 flags,
                                          TSExecuteCallback chkcond);
+extern List *TS_execute_locations(QueryItem *curitem, void *arg,
+                                  uint32 flags,
+                                  TSExecuteCallback chkcond);
 extern bool tsquery_requires_match(QueryItem *curitem);

 /*
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index e6a01cf985..0e68245743 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1815,15 +1815,16 @@ Water, water, every where,
   Nor any drop to drink.
 S. T. Coleridge (1772-1834)
 ', to_tsquery('english', 'day & drink'));
-                        ts_headline
------------------------------------------------------------
- <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,+
-   We stuck, nor breath nor motion,                       +
- As idle as a painted Ship                                +
-   Upon a painted Ocean.                                  +
- Water, water, every where                                +
-   And all the boards did shrink;                         +
- Water, water, every
+            ts_headline
+------------------------------------
+ <b>day</b>,                       +
+   We stuck, nor breath nor motion,+
+ As idle as a painted Ship         +
+   Upon a painted Ocean.           +
+ Water, water, every where         +
+   And all the boards did shrink;  +
+ Water, water, every where,        +
+   Nor any drop
 (1 row)

 SELECT ts_headline('english', '
@@ -1932,12 +1933,12 @@ Water, water, every where,
   Nor any drop to drink.
 S. T. Coleridge (1772-1834)
 ', phraseto_tsquery('english', 'painted Ocean'));
-              ts_headline
----------------------------------------
- <b>painted</b> Ship                  +
-   Upon a <b>painted</b> <b>Ocean</b>.+
- Water, water, every where            +
-   And all the boards did shrink
+           ts_headline
+----------------------------------
+ <b>painted</b> <b>Ocean</b>.    +
+ Water, water, every where       +
+   And all the boards did shrink;+
+ Water, water, every
 (1 row)

 SELECT ts_headline('english', '
@@ -1972,9 +1973,9 @@ SELECT ts_headline('english',
 'Lorem ipsum urna.  Nullam nullam ullamcorper urna.',
 phraseto_tsquery('english','ullamcorper urna'),
 'MaxWords=100, MinWords=5');
-                        ts_headline
-------------------------------------------------------------
- <b>urna</b>.  Nullam nullam <b>ullamcorper</b> <b>urna</b>
+                         ts_headline
+-------------------------------------------------------------
+ <b>urna</b>.  Nullam nullam <b>ullamcorper</b> <b>urna</b>.
 (1 row)

 SELECT ts_headline('english', '
@@ -2019,9 +2020,9 @@ SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 & 3', 'MaxWords=4, MinWords=1
 (1 row)

 SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=4, MinWords=1');
-        ts_headline
-----------------------------
- <b>3</b> <b>1</b> <b>3</b>
+    ts_headline
+-------------------
+ <b>1</b> <b>3</b>
 (1 row)

 --Check if headline fragments work

Re: Rethinking the implementation of ts_headline()

От

Alvaro Herrera

Дата:

16 января 2023 г., 15:23:03

On 2022-Nov-25, Tom Lane wrote:

> After further contemplation of bug #17691 [1], I've concluded that
> what I did in commit c9b0c678d was largely misguided.  For one
> thing, the new hlCover() algorithm no longer finds shortest-possible
> cover strings: if your query is "x & y" and the text is like
> "... x ... x ... y ...", then the selected cover string will run
> from the first occurrence of x to the y, whereas the old algorithm
> would have correctly selected "x ... y".  For another thing, the
> maximum-cover-length hack that I added in 78e73e875 to band-aid
> over the performance issues of the original c9b0c678d patch means
> that various scenarios no longer work as well as they used to,
> which is the proximate cause of the complaints in bug #17691.

I came across #17556 which contains a different test for this, and I'm
not sure that this patch changes things completely for the better.  In
that bug report, Alex Malek presents this example

select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || 'labor',
           $$'ipsum' & 'labor'$$::tsquery,
       'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3'),
    ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || 'labor',
           $$'ipsum' & 'labor'$$::tsquery,
       'StartSel={, StopSel=}, MaxFragments=100, MaxWords=7, MinWords=3');

which returns, in the current HEAD, the following
     ts_headline     │ ts_headline 
─────────────────────┼─────────────
 {ipsum} ... {labor} │ baz baz baz
(1 fila)

That is, once past the 5000 words of distance, it fails to find a good
cover, but before that it returns an acceptable headline.  However,
after your proposed patch, we get this:

 ts_headline │ ts_headline 
─────────────┼─────────────
 {ipsum}     │ {ipsum}
(1 fila)

which is an improvement in the second case, though perhaps not as much
as we would like, and definitely not an improvement in the first case.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."                   (New York Times, about Microsoft PowerPoint)

Re: Rethinking the implementation of ts_headline()

От

Tom Lane

Дата:

16 января 2023 г., 19:28:52

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> I came across #17556 which contains a different test for this, and I'm
> not sure that this patch changes things completely for the better.

Thanks for looking at my patch.  However ...

> That is, once past the 5000 words of distance, it fails to find a good
> cover, but before that it returns an acceptable headline.  However,
> after your proposed patch, we get this:

>  ts_headline │ ts_headline
> ─────────────┼─────────────
>  {ipsum}     │ {ipsum}
> (1 fila)

I get this with the patch:

     ts_headline     |     ts_headline
---------------------+---------------------
 {ipsum} ... {labor} | {ipsum} ... {labor}
(1 row)

which is what I'd expect, because it removes the artificial limit on
cover length that I added in 78e73e875.  So I'm wondering why you got a
different result.  Maybe something to do with locale?  I tried it in
C and en_US.utf8.

            regards, tom lane

Re: Rethinking the implementation of ts_headline()

От

Alvaro Herrera

Дата:

17 января 2023 г., 11:32:11

On 2023-Jan-16, Tom Lane wrote:

> I get this with the patch:
> 
>      ts_headline     |     ts_headline     
> ---------------------+---------------------
>  {ipsum} ... {labor} | {ipsum} ... {labor}
> (1 row)
> 
> which is what I'd expect, because it removes the artificial limit on
> cover length that I added in 78e73e875.  So I'm wondering why you got a
> different result.

Indeed, that's what I get now, too, after re-applying the patches.
I find no way to reproduce the bogus result I got yesterday.

Sorry for the noise.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801

Re: Rethinking the implementation of ts_headline()

От

Alvaro Herrera

Дата:

18 января 2023 г., 14:09:42

I tried this other test, based on looking at the new regression tests
you added,

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
               ts_headline               
─────────────────────────────────────────
 motion,                                ↵
 As <b>idle</b> as a <b>painted</b> Ship↵
   Upon
(1 fila)

and was surprised that the match for the 'day & drink' arm of the OR
disappears from the reported headline.

This is what 15 reports for the same query:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
                        ts_headline                        
───────────────────────────────────────────────────────────
 <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
   We stuck ... motion,                                   ↵
 As <b>idle</b> as a <b>painted</b> Ship                  ↵
   Upon
(1 fila)

I think this was better.

15 seems to fail in other ways; for instance, 'drink' is not highlighted in the
headline when the OR matches, but if the other arm of the OR doesn't match, it
is; for example both 15 and master return the same for this one:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (mountain & backpack)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
                        ts_headline                        
───────────────────────────────────────────────────────────
 <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
   We stuck ... drop to <b>drink</b>.                     ↵
 S. T. Coleridge
(1 fila)



Another thing I think might be a regression is the way fragments are
selected.  Consider what happens if I change the "idle & painted" in the
earlier query to "idle <-> painted", and MaxWords is kept low:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
                  ts_headline                  
───────────────────────────────────────────────
 <b>day</b>,                                  ↵
   We stuck, nor breath nor motion,           ↵
 As <b>idle</b> ... <b>painted</b> Ship       ↵
   Upon a <b>painted</b> Ocean.               ↵
 Water, water, every ... drop to <b>drink</b>.↵
 S. T. Coleridge
(1 fila)

Note that it chose to put a fragment delimiter exactly in the middle of the
phrase match, where the stop words are.  If I raise MaxWords, it is of course
much better, I suppose because the word limit doesn't force a new fragment,

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=25, MinWords=4');
                   ts_headline                    
──────────────────────────────────────────────────
 after <b>day</b>, <b>day</b> after <b>day</b>,  ↵
   We stuck, nor breath nor motion,              ↵
 As <b>idle</b> as a <b>painted</b> Ship         ↵
   Upon a <b>painted</b> Ocean.                  ↵
 Water, water, every where ... boards did shrink;↵
 Water, water, every where,                      ↵
   Nor any drop to <b>drink</b>.                 ↵
 S. T. Coleridge
(1 fila)

But in 15, the query with low MaxWords does this instead, where the
fragment delimiter occurs just *before* the phrasal match.

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', '(day & drink) | (idle <-> painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
                        ts_headline                        
───────────────────────────────────────────────────────────
 <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
   We stuck ... <b>idle</b> as a <b>painted</b> Ship      ↵
   Upon a <b>painted</b> Ocean ... drop to <b>drink</b>.  ↵
 S. T. Coleridge
(1 fila)

(Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
verse, which perhaps they shouldn't do, since it's not preceded by
'idle'.)


(I think it's super annoying that the fragment separation algorithm
fails to preserve newlines between verses as it adds the '...'
separator.  But I guess poetry is not the main use case for text search
anyway, so it probably doesn't matter much.)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801

Re: Rethinking the implementation of ts_headline()

От

Tom Lane

Дата:

19 января 2023 г., 01:55:46

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> I tried this other test, based on looking at the new regression tests
> you added,

> SELECT ts_headline('english', '
> Day after day, day after day,
>   We stuck, nor breath nor motion,
> As idle as a painted Ship
>   Upon a painted Ocean.
> Water, water, every where
>   And all the boards did shrink;
> Water, water, every where,
>   Nor any drop to drink.
> S. T. Coleridge (1772-1834)
> ', to_tsquery('english', '(day & drink) | (idle & painted)'), 'MaxFragments=5, MaxWords=9, MinWords=4');
>                ts_headline
> ─────────────────────────────────────────
>  motion,                                ↵
>  As <b>idle</b> as a <b>painted</b> Ship↵
>    Upon
> (1 fila)

> and was surprised that the match for the 'day & drink' arm of the OR
> disappears from the reported headline.

I'd argue that that's exactly what should happen.  It's supposed to
find as-short-as-possible cover strings that satisfy the query.
In this case, satisfying the 'day & drink' condition would require
nearly the entire input text, whereas 'idle & painted' can be
satisfied in just the third line.  So what you get is the third line,
slightly expanded because of some later rules that like to add
context if the cover is shorter than MaxWords.  I don't find anything
particularly principled about the old behavior:

>  <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
>    We stuck ... motion,                                   ↵
>  As <b>idle</b> as a <b>painted</b> Ship                  ↵
>    Upon

It's including hits for "day" into the cover despite the lack of any
nearby match to "drink".

> Another thing I think might be a regression is the way fragments are
> selected.  Consider what happens if I change the "idle & painted" in the
> earlier query to "idle <-> painted", and MaxWords is kept low:

Of course, "idle <-> painted" is satisfied nowhere in this text
(the words are there, but not adjacent).  So the cover has to
run from the last 'day' to the 'drink'.  I think v15 is deciding
that it runs from the first 'day' to the 'drink', which while not
exactly wrong is not the shortest cover.  The rest of this is just
minor variations in what mark_hl_fragments() decides to do based
on the precise cover string it's given.  I don't dispute that
mark_hl_fragments() could be improved, but this patch doesn't touch
its algorithm and I think that doing so should be material for a
different patch.  (I have no immediate ideas about what would be
a better algorithm for it, anyway.)

> (Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
> verse, which perhaps they shouldn't do, since it's not preceded by
> 'idle'.)

Yeah, and 'idle' too.  Once it's chosen a string to show, it'll
highlight all the query words within that string, whether they
constitute part of the match or not.  I can see arguments on both
sides of doing it that way; it was probably more sensible before
we had phrase match than it is now.  But again, changing that phase
of the processing is outside the scope of this patch.  I'm just
trying to undo the damage I did to the cover-string-selection phase.

            regards, tom lane

Re: Rethinking the implementation of ts_headline()

От

Alvaro Herrera

Дата:

19 января 2023 г., 13:16:29

On 2023-Jan-18, Tom Lane wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> > and was surprised that the match for the 'day & drink' arm of the OR
> > disappears from the reported headline.
> 
> I'd argue that that's exactly what should happen.  It's supposed to
> find as-short-as-possible cover strings that satisfy the query.

OK, that makes sense.

> I don't find anything particularly principled about the old behavior:
> 
> >  <b>Day</b> after <b>day</b>, <b>day</b> after <b>day</b>,↵
> >    We stuck ... motion,                                   ↵
> >  As <b>idle</b> as a <b>painted</b> Ship                  ↵
> >    Upon
> 
> It's including hits for "day" into the cover despite the lack of any
> nearby match to "drink".

I suppose it would be possible to put 'day' and 'drink' in two different
fragments: since the query has a & operator for them, the words don't
necessarily have to be nearby.  But OK, your argument for this being the
shortest result is sensible.

I do wonder, though, if it's effectively usable for somebody building a
search interface on top.  If I'm ranking the results from several
documents, and this document comes on top of others that only match one
arm of the OR query, then I would like to be able to show the matches
for both arms of the OR.

> > Another thing I think might be a regression is the way fragments are
> > selected.  Consider what happens if I change the "idle & painted" in the
> > earlier query to "idle <-> painted", and MaxWords is kept low:
> 
> Of course, "idle <-> painted" is satisfied nowhere in this text
> (the words are there, but not adjacent).

Oh, I see the problem, and it is my misunderstanding: the <-> operator
is counting the words in between, even if they are stop words.  I
understood from the docs that those words were ignored, but that is not
so.  I misread the phraseto_tsquery doc as though they were explaining
the <-> operator.

Another experiment shows that the headline becomes "complete" only if I
specify the exact distance in the <N> operator:

SELECT dist, ts_headline('simple', 'As idle as a painted Ship', to_tsquery('simple', format('(idle <%s> painted)',
dist)),'MaxFragments=5, MaxWords=8, MinWords=4') from generate_series(1, 4) dist;

 dist │             ts_headline              
──────┼──────────────────────────────────────
    1 │ As <b>idle</b> as a
    2 │ As <b>idle</b> as a
    3 │ <b>idle</b> as a <b>painted</b> Ship
    4 │ As <b>idle</b> as a
(4 filas)

I again have to question how valuable in practice is a <N> operator
that's so strict that I have to know exactly how many stop words I want
there to be in between the phrase search.  For some reason, in my mind I
had it as "at most N words, ignoring stop words", but that's not what it
is.

Anyway, I don't think this needs to stop your current patch.

> So the cover has to run from the last 'day' to the 'drink'.  I think
> v15 is deciding that it runs from the first 'day' to the 'drink',
> which while not exactly wrong is not the shortest cover.

Sounds fair.

> The rest of this is just minor variations in what mark_hl_fragments()
> decides to do based on the precise cover string it's given.  I don't
> dispute that mark_hl_fragments() could be improved, but this patch
> doesn't touch its algorithm and I think that doing so should be
> material for a different patch.

Understood and agreed.

> > (Both 15 and master highlight 'painted' in the "Upon a painted Ocean"
> > verse, which perhaps they shouldn't do, since it's not preceded by
> > 'idle'.)
> 
> Yeah, and 'idle' too.  Once it's chosen a string to show, it'll
> highlight all the query words within that string, whether they
> constitute part of the match or not.  I can see arguments on both
> sides of doing it that way; it was probably more sensible before
> we had phrase match than it is now.  But again, changing that phase
> of the processing is outside the scope of this patch.  I'm just
> trying to undo the damage I did to the cover-string-selection phase.

All clear then.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)

Re: Rethinking the implementation of ts_headline()

От

Tom Lane

Дата:

19 января 2023 г., 19:13:30

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2023-Jan-18, Tom Lane wrote:
>> It's including hits for "day" into the cover despite the lack of any
>> nearby match to "drink".

> I suppose it would be possible to put 'day' and 'drink' in two different
> fragments: since the query has a & operator for them, the words don't
> necessarily have to be nearby.  But OK, your argument for this being the
> shortest result is sensible.

> I do wonder, though, if it's effectively usable for somebody building a
> search interface on top.  If I'm ranking the results from several
> documents, and this document comes on top of others that only match one
> arm of the OR query, then I would like to be able to show the matches
> for both arms of the OR.

The fundamental problem with the case you're posing is that MaxWords
is too small to allow the 'day & drink' match to be shown as a whole.
If you make MaxWords large enough then you do find it including
(and highlighting) 'drink', but I'm not sure we should stress out
about what happens otherwise.

> Oh, I see the problem, and it is my misunderstanding: the <-> operator
> is counting the words in between, even if they are stop words.

Yeah.  AFAICS this is a very deliberate, longstanding decision,
so I'm hesitant to change it.  Your test case with 'simple'
proves little, because there are no stop words in 'simple':

regression=# select to_tsvector('simple', 'As idle as a painted Ship');
                 to_tsvector                  
----------------------------------------------
 'a':4 'as':1,3 'idle':2 'painted':5 'ship':6
(1 row)

However, when I switch to 'english':

regression=# select to_tsvector('english', 'As idle as a painted Ship');
        to_tsvector         
----------------------------
 'idl':2 'paint':5 'ship':6
(1 row)

the stop words are gone, but the recorded positions remain the same.
So this is really a matter of how to_tsvector chooses to count word
positions, it's not the fault of the <-> construct in particular.

I'm not convinced that this particular behavior is wrong, anyway.
The user of text search isn't supposed to have to think about
which words are stopwords or not, so I think that it's entirely
sensible for 'idle as a painted' to match 'idle <3> painted'.
Maybe the docs need some adjustment?  But in any case that's
material for a different thread.

> I again have to question how valuable in practice is a <N> operator
> that's so strict that I have to know exactly how many stop words I want
> there to be in between the phrase search.  For some reason, in my mind I
> had it as "at most N words, ignoring stop words", but that's not what it
> is.

Yeah, I recall discussing "up to N words" semantics for this in the
past, but nobody has made that happen.

> Anyway, I don't think this needs to stop your current patch.

Many thanks for looking at it!

            regards, tom lane

Re: Rethinking the implementation of ts_headline()

От

Alexander Lakhin

Дата:

06 апреля 2023 г., 20:00:00

Hi,

19.01.2023 19:13, Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Anyway, I don't think this needs to stop your current patch.

Many thanks for looking at it!

I've found that starting from commit 5a617d75 this query:
SELECT ts_headline('english', 'To be, or not to be', to_tsquery('english', 'or'));

invokes a valgrind-detected error:
==00:00:00:03.950 3241424== Invalid read of size 1
==00:00:00:03.950 3241424==    at 0x8EE2A9: TS_execute_locations_recurse (tsvector_op.c:2046)
==00:00:00:03.950 3241424==    by 0x8EE220: TS_execute_locations (tsvector_op.c:2017)
==00:00:00:03.950 3241424==    by 0x77EAC4: prsd_headline (wparser_def.c:2677)
==00:00:00:03.950 3241424==    by 0x94536C: FunctionCall3Coll (fmgr.c:1173)
==00:00:00:03.950 3241424==    by 0x778648: ts_headline_byid_opt (wparser.c:322)
==00:00:00:03.950 3241424==    by 0x9440F5: DirectFunctionCall3Coll (fmgr.c:836)
==00:00:00:03.950 3241424==    by 0x778763: ts_headline_byid (wparser.c:343)
==00:00:00:03.950 3241424==    by 0x4AC9F2: ExecInterpExpr (execExprInterp.c:752)
==00:00:00:03.950 3241424==    by 0x4AEDFE: ExecInterpExprStillValid (execExprInterp.c:1838)
==00:00:00:03.950 3241424==    by 0x636A7E: ExecEvalExprSwitchContext (executor.h:344)
==00:00:00:03.950 3241424==    by 0x63E92D: evaluate_expr (clauses.c:4843)
==00:00:00:03.950 3241424==    by 0x63DA53: evaluate_function (clauses.c:4345)
...
(Initially I had encountered an asan-detected heap-buffer-overflow with a
more informative document.)

But the less-verbose call:
SELECT ts_headline('', '');

discovers a different error even on 5a617d75~1:
==00:00:00:04.113 3139158== Conditional jump or move depends on uninitialised value(s)
==00:00:00:04.113 3139158==    at 0x77B44F: mark_fragment (wparser_def.c:2100)
==00:00:00:04.113 3139158==    by 0x77E2F2: mark_hl_words (wparser_def.c:2519)
==00:00:00:04.113 3139158==    by 0x77E891: prsd_headline (wparser_def.c:2610)
==00:00:00:04.113 3139158==    by 0x944B68: FunctionCall3Coll (fmgr.c:1173)
==00:00:00:04.113 3139158==    by 0x778648: ts_headline_byid_opt (wparser.c:322)
==00:00:00:04.113 3139158==    by 0x9438F1: DirectFunctionCall3Coll (fmgr.c:836)
==00:00:00:04.113 3139158==    by 0x7787B6: ts_headline (wparser.c:352)
==00:00:00:04.113 3139158==    by 0x4AC9F2: ExecInterpExpr (execExprInterp.c:752)
==00:00:00:04.113 3139158==    by 0x4AEDFE: ExecInterpExprStillValid (execExprInterp.c:1838)
==00:00:00:04.113 3139158==    by 0x50BD0C: ExecEvalExprSwitchContext (executor.h:344)
==00:00:00:04.113 3139158==    by 0x50BD84: ExecProject (executor.h:378)
==00:00:00:04.113 3139158==    by 0x50BFBB: ExecResult (nodeResult.c:136)
==00:00:00:04.113 3139158==

I've reproduced it on REL9_4_STABLE (REL9_4_15) and it looks like the defect
in mark_hl_words():
        int                     bestb = -1,
                                beste = -1;
        int                     bestlen = -1;
        int                     pose = 0,
...
        if (highlight == 0)
        {
                while (hlCover(prs, query, &p, &q))
                {
...
                if (bestlen < 0)
                {
                        curlen = 0;
                        for (i = 0; i < prs->curwords && curlen < min_words; i++)
                        {
                                if (!NONWORDTOKEN(prs->words[i].type))
                                        curlen++;
                                pose = i;
                        }
                        bestb = 0;
                        beste = pose;
                }
...
// here we have bestb == 0, beste == 0, but no prs->words in this case
        for (i = bestb; i <= beste; i++)
        {
                if (prs->words[i].item)
                        prs->words[i].selected = 1;

exists since 2a0083ede.
(Sorry for the distraction.)

Best regards,
Alexander

Re: Rethinking the implementation of ts_headline()

От

Tom Lane

Дата:

06 апреля 2023 г., 22:55:58

Alexander Lakhin <exclusion@gmail.com> writes:
> I've found that starting from commit 5a617d75 this query:
> SELECT ts_headline('english', 'To be, or not to be', to_tsquery('english', 'or'));
> invokes a valgrind-detected error:
> ==00:00:00:03.950 3241424== Invalid read of size 1

On my machine, I also see PG-reported errors such as "unrecognized
operator: 0".  It's a live bug all right.  We need to be more careful
about empty tsqueries.

> But the less-verbose call:
> SELECT ts_headline('', '');

> discovers a different error even on 5a617d75~1:
> ==00:00:00:04.113 3139158== Conditional jump or move depends on uninitialised value(s)
> ==00:00:00:04.113 3139158==    at 0x77B44F: mark_fragment (wparser_def.c:2100)

Yeah, this one seems to be ancient sloppiness.  I don't think it has
any bad effect beyond annoying valgrind, but I fixed it anyway.

Thanks for the report!

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Rethinking the implementation of ts_headline()