Обсуждение: Some regular-expression performance hacking

Поиск

Список

Период

Сортировка

Some regular-expression performance hacking

От

Tom Lane

Дата:

11 февраля 2021 г., 04:39:43

As I mentioned in connection with adding the src/test/modules/test_regex
test code, I've been fooling with some performance improvements to our
regular expression engine.  Here's the first fruits of that labor.
This is mostly concerned with cutting the overhead for handling trivial
unconstrained patterns like ".*".

0001 creates the concept of a "rainbow" arc within regex NFAs.  You can
read background info about this in the "Colors and colormapping" part of
regex/README, but the basic point is that right now, representing a dot
(".", match anything) within an NFA requires a separate arc for each
"color" (character equivalence class) that the regex needs.  This uses
up a fair amount of storage and processing effort, especially in larger
regexes which tend to have a lot of colors.  We can replace such a
"rainbow" of arcs with a single arc labeled with a special color
RAINBOW.  This is worth doing on its own account, just because it saves
space and time.  For example, on the reg-33.15.1 test case in
test_regex.sql (a moderately large real-world RE), I find that HEAD
requires 1377614 bytes to represent the compiled RE, and the peak space
usage during pg_regcomp() is 3124376 bytes.  With this patch, that drops
to 1077166 bytes for the RE (21% savings) with peak compilation space
2800752 bytes (10% savings).  Moreover, the runtime for that test case
drops from ~57ms to ~44ms, a 22% savings.  (This is mostly measuring the
RE compilation time.  Execution time should drop a bit too since miss()
need consider fewer arcs; but that savings is in a cold code path so it
won't matter much.)  These aren't earth-shattering numbers of course,
but for the amount of code needed, it seems well worth while.

A possible point of contention is that I exposed the idea of a rainbow
arc in the regexport.h APIs, which will force consumers of that API
to adapt --- see the changes to contrib/pg_trgm for an example.  I'm
not too concerned about this because I kinda suspect that pg_trgm is
the only consumer of that API anywhere.  (codesearch.debian.net knows
of no others, anyway.)  We could in principle hide the change by
having the regexport functions expand a rainbow arc into one for
each color, but that seems like make-work.  pg_trgm would certainly
not see it as an improvement, and in general users of that API should
appreciate recognizing rainbows as such, since they might be able to
apply optimizations that depend on doing so.

Which brings us to 0002, which is exactly such an optimization.
The idea here is to short-circuit character-by-character scanning
when matching a sub-NFA that is like "." or ".*" or variants of
that, ie it will match any sequence of some number of characters.
This requires the ability to recognize that a particular pair of
NFA states are linked by a rainbow, so it's a lot less painful
to do when rainbows are represented explicitly.  The example that
got me interested in this is adapted from a Tcl trouble report:

select array_dims(regexp_matches(repeat('x',40) || '=' || repeat('y',50000),
                                 '^(.*)=(.*)$'));

On my machine, this takes about 6 seconds in HEAD, because there's an
O(N^2) effect: we try to match the sub-NFA for the first "(.*)" capture
group to each possible starting string, and only after expensively
verifying that tautological match do we check to see if the next
character is "=".  By not having to do any per-character work to decide
that .* matches a substring, the O(N^2) behavior is removed and the time
drops to about 7 msec.

(One could also imagine fixing this by rearranging things to check for
the "=" match before verifying the capture-group matches.  That's an
idea I hope to look into in future, because it could help for cases
where the variable parts are not merely ".*".  But I don't have clear
ideas about how to do that, and in any case ".*" is common enough that
the present change should still be helpful.)

There are two non-boilerplate parts of the 0002 patch.  One is the
checkmatchall() function that determines whether an NFA is match-all,
and if so what the min and max match lengths are.  This is actually not
very complicated once you understand what the regex engine does at the
"pre" and "post" states.  (See the "Detailed semantics" part of
regex/README for some info about that, which I tried to clarify as part
of the patch.)  Other than those endpoint conditions it's just a
recursive graph search.  The other hard part is the changes in
rege_dfa.c to provide the actual short-circuit behavior at runtime.
That's ticklish because it's trying to emulate some overly complicated
and underly documented code, particularly in longest() and shortest().
I think that stuff is right; I've studied it and tested it.  But it
could use more eyeballs.

Notably, I had to add some more test cases to test_regex.sql to exercise
the short-circuit part of matchuntil() properly.  That's only used for
lookbehind constraints, so we won't hit the short-circuit path except
with something like '(?<=..)', which is maybe a tad silly.

I'll add this to the upcoming commitfest.

            regards, tom lane

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1e4f0121f3..fcf03de32d 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -282,8 +282,8 @@ typedef struct
 typedef int TrgmColor;

 /* We assume that colors returned by the regexp engine cannot be these: */
-#define COLOR_UNKNOWN    (-1)
-#define COLOR_BLANK        (-2)
+#define COLOR_UNKNOWN    (-3)
+#define COLOR_BLANK        (-4)

 typedef struct
 {
@@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
         palloc0(colorsCount * sizeof(TrgmColorInfo));

     /*
-     * Loop over colors, filling TrgmColorInfo about each.
+     * Loop over colors, filling TrgmColorInfo about each.  Note we include
+     * WHITE (0) even though we know it'll be reported as non-expandable.
      */
     for (i = 0; i < colorsCount; i++)
     {
@@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
             /* Add enter key to this state */
             addKeyToQueue(trgmNFA, &destKey);
         }
-        else
+        else if (arc->co >= 0)
         {
-            /* Regular color */
+            /* Regular color (including WHITE) */
             TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];

             if (colorInfo->expandable)
@@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
                 addKeyToQueue(trgmNFA, &destKey);
             }
         }
+        else
+        {
+            /* RAINBOW: treat as unexpandable color */
+            destKey.prefix.colors[0] = COLOR_UNKNOWN;
+            destKey.prefix.colors[1] = COLOR_UNKNOWN;
+            destKey.nstate = arc->to;
+            addKeyToQueue(trgmNFA, &destKey);
+        }
     }

     pfree(arcs);
@@ -1216,10 +1225,10 @@ addArcs(TrgmNFA *trgmNFA, TrgmState *state)
             /*
              * Ignore non-expandable colors; addKey already handled the case.
              *
-             * We need no special check for begin/end pseudocolors here.  We
-             * don't need to do any processing for them, and they will be
-             * marked non-expandable since the regex engine will have reported
-             * them that way.
+             * We need no special check for WHITE or begin/end pseudocolors
+             * here.  We don't need to do any processing for them, and they
+             * will be marked non-expandable since the regex engine will have
+             * reported them that way.
              */
             if (!colorInfo->expandable)
                 continue;
diff --git a/src/backend/regex/README b/src/backend/regex/README
index f08aab69e3..cc1834b89c 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -261,6 +261,18 @@ and the NFA has these arcs:
     states 4 -> 5 on color 2 ("x" only)
 which can be seen to be a correct representation of the regex.

+There is one more complexity, which is how to handle ".", that is a
+match-anything atom.  We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot.  That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic.  Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors.  Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline.  In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
 Given this summary, we can see we need the following operations for
 colors:

@@ -349,6 +361,8 @@ The possible arc types are:

     PLAIN arcs, which specify matching of any character of a given "color"
     (see above).  These are dumped as "[color_number]->to_state".
+    In addition there can be "rainbow" PLAIN arcs, which are dumped as
+    "[*]->to_state".

     EMPTY arcs, which specify a no-op transition to another state.  These
     are dumped as "->to_state".
@@ -356,11 +370,11 @@ The possible arc types are:
     AHEAD constraints, which represent a "next character must be of this
     color" constraint.  AHEAD differs from a PLAIN arc in that the input
     character is not consumed when crossing the arc.  These are dumped as
-    ">color_number>->to_state".
+    ">color_number>->to_state", or possibly ">*>->to_state".

     BEHIND constraints, which represent a "previous character must be of
     this color" constraint, which likewise consumes no input.  These are
-    dumped as "<color_number<->to_state".
+    dumped as "<color_number<->to_state", or possibly "<*<->to_state".

     '^' arcs, which specify a beginning-of-input constraint.  These are
     dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index f5a4151757..0864011cce 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -977,6 +977,7 @@ colorchain(struct colormap *cm,
 {
     struct colordesc *cd = &cm->cd[a->co];

+    assert(a->co >= 0);
     if (cd->arcs != NULL)
         cd->arcs->colorchainRev = a;
     a->colorchain = cd->arcs;
@@ -994,6 +995,7 @@ uncolorchain(struct colormap *cm,
     struct colordesc *cd = &cm->cd[a->co];
     struct arc *aa = a->colorchainRev;

+    assert(a->co >= 0);
     if (aa == NULL)
     {
         assert(cd->arcs == a);
@@ -1012,6 +1014,9 @@ uncolorchain(struct colormap *cm,

 /*
  * rainbow - add arcs of all full colors (but one) between specified states
+ *
+ * If there isn't an exception color, we now generate just a single arc
+ * labeled RAINBOW, saving lots of arc-munging later on.
  */
 static void
 rainbow(struct nfa *nfa,
@@ -1025,6 +1030,13 @@ rainbow(struct nfa *nfa,
     struct colordesc *end = CDEND(cm);
     color        co;

+    if (but == COLORLESS)
+    {
+        newarc(nfa, type, RAINBOW, from, to);
+        return;
+    }
+
+    /* Gotta do it the hard way.  Skip subcolors, pseudocolors, and "but" */
     for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
         if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
             !(cd->flags & PSEUDO))
@@ -1034,13 +1046,16 @@ rainbow(struct nfa *nfa,
 /*
  * colorcomplement - add arcs of complementary colors
  *
+ * We add arcs of all colors that are not pseudocolors and do not match
+ * any of the "of" state's PLAIN outarcs.
+ *
  * The calling sequence ought to be reconciled with cloneouts().
  */
 static void
 colorcomplement(struct nfa *nfa,
                 struct colormap *cm,
                 int type,
-                struct state *of,    /* complements of this guy's PLAIN outarcs */
+                struct state *of,
                 struct state *from,
                 struct state *to)
 {
@@ -1049,6 +1064,11 @@ colorcomplement(struct nfa *nfa,
     color        co;

     assert(of != from);
+
+    /* A RAINBOW arc matches all colors, making the complement empty */
+    if (findarc(of, PLAIN, RAINBOW) != NULL)
+        return;
+
     for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
         if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
             if (findarc(of, PLAIN, co) == NULL)
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 92c9c4d795..1ac030570d 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -271,6 +271,11 @@ destroystate(struct nfa *nfa,
  *
  * This function checks to make sure that no duplicate arcs are created.
  * In general we never want duplicates.
+ *
+ * However: in principle, a RAINBOW arc is redundant with any plain arc
+ * (unless that arc is for a pseudocolor).  But we don't try to recognize
+ * that redundancy, either here or in allied operations such as moveins().
+ * The pseudocolor consideration makes that more costly than it seems worth.
  */
 static void
 newarc(struct nfa *nfa,
@@ -1170,6 +1175,9 @@ copyouts(struct nfa *nfa,

 /*
  * cloneouts - copy out arcs of a state to another state pair, modifying type
+ *
+ * This is only used to convert PLAIN arcs to AHEAD/BEHIND arcs, which share
+ * the same interpretation of "co".  It wouldn't be sensible with LACONs.
  */
 static void
 cloneouts(struct nfa *nfa,
@@ -1181,9 +1189,13 @@ cloneouts(struct nfa *nfa,
     struct arc *a;

     assert(old != from);
+    assert(type == AHEAD || type == BEHIND);

     for (a = old->outs; a != NULL; a = a->outchain)
+    {
+        assert(a->type == PLAIN);
         newarc(nfa, type, a->co, from, to);
+    }
 }

 /*
@@ -1597,7 +1609,7 @@ pull(struct nfa *nfa,
     for (a = from->ins; a != NULL && !NISERR(); a = nexta)
     {
         nexta = a->inchain;
-        switch (combine(con, a))
+        switch (combine(nfa, con, a))
         {
             case INCOMPATIBLE:    /* destroy the arc */
                 freearc(nfa, a);
@@ -1624,6 +1636,10 @@ pull(struct nfa *nfa,
                 cparc(nfa, a, s, to);
                 freearc(nfa, a);
                 break;
+            case REPLACEARC:    /* replace arc's color */
+                newarc(nfa, a->type, con->co, a->from, to);
+                freearc(nfa, a);
+                break;
             default:
                 assert(NOTREACHED);
                 break;
@@ -1764,7 +1780,7 @@ push(struct nfa *nfa,
     for (a = to->outs; a != NULL && !NISERR(); a = nexta)
     {
         nexta = a->outchain;
-        switch (combine(con, a))
+        switch (combine(nfa, con, a))
         {
             case INCOMPATIBLE:    /* destroy the arc */
                 freearc(nfa, a);
@@ -1791,6 +1807,10 @@ push(struct nfa *nfa,
                 cparc(nfa, a, from, s);
                 freearc(nfa, a);
                 break;
+            case REPLACEARC:    /* replace arc's color */
+                newarc(nfa, a->type, con->co, from, a->to);
+                freearc(nfa, a);
+                break;
             default:
                 assert(NOTREACHED);
                 break;
@@ -1810,9 +1830,11 @@ push(struct nfa *nfa,
  * #def INCOMPATIBLE    1    // destroys arc
  * #def SATISFIED        2    // constraint satisfied
  * #def COMPATIBLE        3    // compatible but not satisfied yet
+ * #def REPLACEARC        4    // replace arc's color with constraint color
  */
 static int
-combine(struct arc *con,
+combine(struct nfa *nfa,
+        struct arc *con,
         struct arc *a)
 {
 #define  CA(ct,at)     (((ct)<<CHAR_BIT) | (at))
@@ -1827,14 +1849,46 @@ combine(struct arc *con,
         case CA(BEHIND, PLAIN):
             if (con->co == a->co)
                 return SATISFIED;
+            if (con->co == RAINBOW)
+            {
+                /* con is satisfied unless arc's color is a pseudocolor */
+                if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+                    return SATISFIED;
+            }
+            else if (a->co == RAINBOW)
+            {
+                /* con is incompatible if it's for a pseudocolor */
+                if (nfa->cm->cd[con->co].flags & PSEUDO)
+                    return INCOMPATIBLE;
+                /* otherwise, constraint constrains arc to be only its color */
+                return REPLACEARC;
+            }
             return INCOMPATIBLE;
             break;
         case CA('^', '^'):        /* collision, similar constraints */
         case CA('$', '$'):
-        case CA(AHEAD, AHEAD):
+            if (con->co == a->co)    /* true duplication */
+                return SATISFIED;
+            return INCOMPATIBLE;
+            break;
+        case CA(AHEAD, AHEAD):    /* collision, similar constraints */
         case CA(BEHIND, BEHIND):
             if (con->co == a->co)    /* true duplication */
                 return SATISFIED;
+            if (con->co == RAINBOW)
+            {
+                /* con is satisfied unless arc's color is a pseudocolor */
+                if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+                    return SATISFIED;
+            }
+            else if (a->co == RAINBOW)
+            {
+                /* con is incompatible if it's for a pseudocolor */
+                if (nfa->cm->cd[con->co].flags & PSEUDO)
+                    return INCOMPATIBLE;
+                /* otherwise, constraint constrains arc to be only its color */
+                return REPLACEARC;
+            }
             return INCOMPATIBLE;
             break;
         case CA('^', BEHIND):    /* collision, dissimilar constraints */
@@ -2895,6 +2949,7 @@ compact(struct nfa *nfa,
                     break;
                 case LACON:
                     assert(s->no != cnfa->pre);
+                    assert(a->co >= 0);
                     ca->co = (color) (cnfa->ncolors + a->co);
                     ca->to = a->to->no;
                     ca++;
@@ -2902,7 +2957,7 @@ compact(struct nfa *nfa,
                     break;
                 default:
                     NERR(REG_ASSERT);
-                    break;
+                    return;
             }
         carcsort(first, ca - first);
         ca->co = COLORLESS;
@@ -3068,13 +3123,22 @@ dumparc(struct arc *a,
     switch (a->type)
     {
         case PLAIN:
-            fprintf(f, "[%ld]", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, "[*]");
+            else
+                fprintf(f, "[%ld]", (long) a->co);
             break;
         case AHEAD:
-            fprintf(f, ">%ld>", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, ">*>");
+            else
+                fprintf(f, ">%ld>", (long) a->co);
             break;
         case BEHIND:
-            fprintf(f, "<%ld<", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, "<*<");
+            else
+                fprintf(f, "<%ld<", (long) a->co);
             break;
         case LACON:
             fprintf(f, ":%ld:", (long) a->co);
@@ -3161,7 +3225,9 @@ dumpcstate(int st,
     pos = 1;
     for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
     {
-        if (ca->co < cnfa->ncolors)
+        if (ca->co == RAINBOW)
+            fprintf(f, "\t[*]->%d", ca->to);
+        else if (ca->co < cnfa->ncolors)
             fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
         else
             fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 91078dcd80..5956b86026 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -158,7 +158,8 @@ static int    push(struct nfa *, struct arc *, struct state **);
 #define INCOMPATIBLE    1        /* destroys arc */
 #define SATISFIED    2            /* constraint satisfied */
 #define COMPATIBLE    3            /* compatible but not satisfied yet */
-static int    combine(struct arc *, struct arc *);
+#define REPLACEARC    4            /* replace arc's color with constraint color */
+static int    combine(struct nfa *nfa, struct arc *con, struct arc *a);
 static void fixempties(struct nfa *, FILE *);
 static struct state *emptyreachable(struct nfa *, struct state *,
                                     struct state *, struct arc **);
@@ -289,9 +290,11 @@ struct vars
 #define SBEGIN    'A'                /* beginning of string (even if not BOL) */
 #define SEND    'Z'                /* end of string (even if not EOL) */

-/* is an arc colored, and hence on a color chain? */
+/* is an arc colored, and hence should belong to a color chain? */
+/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
 #define COLORED(a) \
-    ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)
+    ((a)->co >= 0 && \
+     ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))


 /* static function list */
@@ -1390,7 +1393,8 @@ bracket(struct vars *v,
  * cbracket - handle complemented bracket expression
  * We do it by calling bracket() with dummy endpoints, and then complementing
  * the result.  The alternative would be to invoke rainbow(), and then delete
- * arcs as the b.e. is seen... but that gets messy.
+ * arcs as the b.e. is seen... but that gets messy, and is really quite
+ * infeasible now that rainbow() just puts out one RAINBOW arc.
  */
 static void
 cbracket(struct vars *v,
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 5695e158a5..32be2592c5 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -612,6 +612,7 @@ miss(struct vars *v,
     unsigned    h;
     struct carc *ca;
     struct sset *p;
+    int            ispseudocolor;
     int            ispost;
     int            noprogress;
     int            gotstate;
@@ -643,13 +644,15 @@ miss(struct vars *v,
      */
     for (i = 0; i < d->wordsper; i++)
         d->work[i] = 0;            /* build new stateset bitmap in d->work */
+    ispseudocolor = d->cm->cd[co].flags & PSEUDO;
     ispost = 0;
     noprogress = 1;
     gotstate = 0;
     for (i = 0; i < d->nstates; i++)
         if (ISBSET(css->states, i))
             for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
-                if (ca->co == co)
+                if (ca->co == co ||
+                    (ca->co == RAINBOW && !ispseudocolor))
                 {
                     BSET(d->work, ca->to);
                     gotstate = 1;
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index d4f940b8c3..a493dbe88c 100644
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -222,7 +222,8 @@ pg_reg_colorisend(const regex_t *regex, int co)
  * Get number of member chrs of color number "co".
  *
  * Note: we return -1 if the color number is invalid, or if it is a special
- * color (WHITE or a pseudocolor), or if the number of members is uncertain.
+ * color (WHITE, RAINBOW, or a pseudocolor), or if the number of members is
+ * uncertain.
  * Callers should not try to extract the members if -1 is returned.
  */
 int
@@ -233,7 +234,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
     assert(regex != NULL && regex->re_magic == REMAGIC);
     cm = &((struct guts *) regex->re_guts)->cmap;

-    if (co <= 0 || co > cm->max)    /* we reject 0 which is WHITE */
+    if (co <= 0 || co > cm->max)    /* <= 0 rejects WHITE and RAINBOW */
         return -1;
     if (cm->cd[co].flags & PSEUDO)    /* also pseudocolors (BOS etc) */
         return -1;
@@ -257,7 +258,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
  * whose length chars_len must be at least as long as indicated by
  * pg_reg_getnumcharacters(), else not all chars will be returned.
  *
- * Fetching the members of WHITE or a pseudocolor is not supported.
+ * Fetching the members of WHITE, RAINBOW, or a pseudocolor is not supported.
  *
  * Caution: this is a relatively expensive operation.
  */
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 1d4593ac94..e2fbad7a8a 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -165,9 +165,13 @@ findprefix(struct cnfa *cnfa,
             /* We can ignore BOS/BOL arcs */
             if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
                 continue;
-            /* ... but EOS/EOL arcs terminate the search, as do LACONs */
+
+            /*
+             * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs
+             * and LACONs
+             */
             if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] ||
-                ca->co >= cnfa->ncolors)
+                ca->co == RAINBOW || ca->co >= cnfa->ncolors)
             {
                 thiscolor = COLORLESS;
                 break;
diff --git a/src/include/regex/regexport.h b/src/include/regex/regexport.h
index e6209463f7..99c4fb854e 100644
--- a/src/include/regex/regexport.h
+++ b/src/include/regex/regexport.h
@@ -30,6 +30,10 @@

 #include "regex/regex.h"

+/* These macros must match corresponding ones in regguts.h: */
+#define COLOR_WHITE        0        /* color for chars not appearing in regex */
+#define COLOR_RAINBOW    (-2)    /* represents all colors except pseudocolors */
+
 /* information about one arc of a regex's NFA */
 typedef struct
 {
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 5d0e7a961c..5bcd669d59 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -130,11 +130,16 @@
 /*
  * As soon as possible, we map chrs into equivalence classes -- "colors" --
  * which are of much more manageable number.
+ *
+ * To further reduce the number of arcs in NFAs and DFAs, we also have a
+ * special RAINBOW "color" that can be assigned to an arc.  This is not a
+ * real color, in that it has no entry in color maps.
  */
 typedef short color;            /* colors of characters */

 #define MAX_COLOR    32767        /* max color (must fit in 'color' datatype) */
 #define COLORLESS    (-1)        /* impossible color */
+#define RAINBOW        (-2)        /* represents all colors except pseudocolors */
 #define WHITE        0            /* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */

@@ -276,7 +281,7 @@ struct state;
 struct arc
 {
     int            type;            /* 0 if free, else an NFA arc type code */
-    color        co;
+    color        co;                /* color the arc matches (possibly RAINBOW) */
     struct state *from;            /* where it's from (and contained within) */
     struct state *to;            /* where it's to */
     struct arc *outchain;        /* link in *from's outs chain or free chain */
@@ -284,6 +289,7 @@ struct arc
 #define  freechain    outchain    /* we do not maintain "freechainRev" */
     struct arc *inchain;        /* link in *to's ins chain */
     struct arc *inchainRev;        /* back-link in *to's ins chain */
+    /* these fields are not used when co == RAINBOW: */
     struct arc *colorchain;        /* link in color's arc chain */
     struct arc *colorchainRev;    /* back-link in color's arc chain */
 };
@@ -344,6 +350,9 @@ struct nfa
  * Plain arcs just store the transition color number as "co".  LACON arcs
  * store the lookaround constraint number plus cnfa.ncolors as "co".  LACON
  * arcs can be distinguished from plain by testing for co >= cnfa.ncolors.
+ *
+ * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
+ * it doesn't break the rule about how to recognize LACON arcs.
  */
 struct carc
 {
diff --git a/src/backend/regex/README b/src/backend/regex/README
index cc1834b89c..a83ab5074d 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -410,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
 the end of the input.
 3. If the NFA is (or can be) in the goal state at this point, it matches.

+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any.  The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
 So one can mentally execute an untransformed NFA by taking ^ and $ as
 ordinary constraints that match at start and end of input; but plain
 arcs out of the start state should be taken as matches for the character
 before the target substring, and similarly, plain arcs leading to the
 post state are matches for the character after the target substring.
-This definition is necessary to support regexes that begin or end with
-constraints such as \m and \M, which imply requirements on the adjacent
-character if any.  NFAs for simple unanchored patterns will usually have
-pre-state outarcs for all possible character colors as well as BOS and
-BOL, and post-state inarcs for all possible character colors as well as
-EOS and EOL, so that the executor's behavior will work.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 1ac030570d..3ebcd9855c 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -65,6 +65,8 @@ newnfa(struct vars *v,
     nfa->v = v;
     nfa->bos[0] = nfa->bos[1] = COLORLESS;
     nfa->eos[0] = nfa->eos[1] = COLORLESS;
+    nfa->flags = 0;
+    nfa->minmatchall = nfa->maxmatchall = -1;
     nfa->parent = parent;        /* Precedes newfstate so parent is valid. */
     nfa->post = newfstate(nfa, '@');    /* number 0 */
     nfa->pre = newfstate(nfa, '>'); /* number 1 */
@@ -2875,8 +2877,14 @@ analyze(struct nfa *nfa)
     if (NISERR())
         return 0;

+    /* Detect whether NFA can't match anything */
     if (nfa->pre->outs == NULL)
         return REG_UIMPOSSIBLE;
+
+    /* Detect whether NFA matches all strings (possibly with length bounds) */
+    checkmatchall(nfa);
+
+    /* Detect whether NFA can possibly match a zero-length string */
     for (a = nfa->pre->outs; a != NULL; a = a->outchain)
         for (aa = a->to->outs; aa != NULL; aa = aa->outchain)
             if (aa->to == nfa->post)
@@ -2884,6 +2892,186 @@ analyze(struct nfa *nfa)
     return 0;
 }

+/*
+ * checkmatchall - does the NFA represent no more than a string length test?
+ *
+ * If so, set nfa->minmatchall and nfa->maxmatchall correctly (they are -1
+ * to begin with) and set the MATCHALL bit in nfa->flags.
+ *
+ * To succeed, we require all arcs to be PLAIN RAINBOW arcs, except that
+ * we can ignore PLAIN arcs for pseudocolors, knowing that such arcs will
+ * appear only at appropriate places in the graph.  We must be able to reach
+ * the post state via RAINBOW arcs, and if there are any loops in the graph,
+ * they must be loop-to-self arcs, ensuring that each loop iteration consumes
+ * exactly one character.  (Longer loops are problematic because they create
+ * non-consecutive possible match lengths; we have no good way to represent
+ * that situation for lengths beyond the DUPINF limit.)
+ */
+static void
+checkmatchall(struct nfa *nfa)
+{
+    bool        hasmatch[DUPINF + 1];
+    int            minmatch,
+                maxmatch,
+                morematch;
+
+    /*
+     * hasmatch[i] will be set true if a match of length i is feasible, for i
+     * from 0 to DUPINF-1.  hasmatch[DUPINF] will be set true if every match
+     * length of DUPINF or more is feasible.
+     */
+    memset(hasmatch, 0, sizeof(hasmatch));
+
+    /*
+     * Recursively search the graph for all-RAINBOW paths to the "post" state,
+     * starting at the "pre" state.  The -1 initial depth accounts for the
+     * fact that transitions out of the "pre" state are not part of the
+     * matched string.  We likewise don't count the final transition to the
+     * "post" state as part of the match length.  (But we still insist that
+     * those transitions have RAINBOW arcs, otherwise there are lookbehind or
+     * lookahead constraints at the start/end of the pattern.)
+     */
+    if (!checkmatchall_recurse(nfa, nfa->pre, false, -1, hasmatch))
+        return;
+
+    /*
+     * We found some all-RAINBOW paths, and not anything that we couldn't
+     * handle.  hasmatch[] now represents the set of possible match lengths;
+     * but we want to reduce that to a min and max value, because it doesn't
+     * seem worth complicating regexec.c to deal with nonconsecutive possible
+     * match lengths.  Find min and max of first run of lengths, then verify
+     * there are no nonconsecutive lengths.
+     */
+    for (minmatch = 0; minmatch <= DUPINF; minmatch++)
+    {
+        if (hasmatch[minmatch])
+            break;
+    }
+    assert(minmatch <= DUPINF); /* else checkmatchall_recurse lied */
+    for (maxmatch = minmatch; maxmatch < DUPINF; maxmatch++)
+    {
+        if (!hasmatch[maxmatch + 1])
+            break;
+    }
+    for (morematch = maxmatch + 1; morematch <= DUPINF; morematch++)
+    {
+        if (hasmatch[morematch])
+            return;                /* fail, there are nonconsecutive lengths */
+    }
+
+    /* Success, so record the info */
+    nfa->minmatchall = minmatch;
+    nfa->maxmatchall = maxmatch;
+    nfa->flags |= MATCHALL;
+}
+
+/*
+ * checkmatchall_recurse - recursive search for checkmatchall
+ *
+ * s is the current state
+ * foundloop is true if any predecessor state has a loop-to-self
+ * depth is the current recursion depth (starting at -1)
+ * hasmatch[] is the output area for recording feasible match lengths
+ *
+ * We return true if there is at least one all-RAINBOW path to the "post"
+ * state and no non-matchall paths; otherwise false.  Note we assume that
+ * any dead-end paths have already been removed, else we might return
+ * false unnecessarily.
+ */
+static bool
+checkmatchall_recurse(struct nfa *nfa, struct state *s,
+                      bool foundloop, int depth,
+                      bool *hasmatch)
+{
+    bool        result = false;
+    struct arc *a;
+
+    /*
+     * Since this is recursive, it could be driven to stack overflow.  But we
+     * need not treat that as a hard failure; just deem the NFA non-matchall.
+     */
+    if (STACK_TOO_DEEP(nfa->v->re))
+        return false;
+
+    /*
+     * Likewise, if we get to a depth too large to represent correctly in
+     * maxmatchall, fail quietly.
+     */
+    if (depth >= DUPINF)
+        return false;
+
+    /*
+     * Scan the outarcs to detect cases we can't handle, and to see if there
+     * is a loop-to-self here.  We need to know about any such loop before we
+     * recurse, so it's hard to avoid making two passes over the outarcs.  In
+     * any case, checking for showstoppers before we recurse is probably best.
+     */
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->type != PLAIN)
+            return false;        /* any LACONs make it non-matchall */
+        if (a->co != RAINBOW)
+        {
+            if (nfa->cm->cd[a->co].flags & PSEUDO)
+                continue;        /* ignore pseudocolor transitions */
+            return false;        /* any other color makes it non-matchall */
+        }
+        if (a->to == s)
+        {
+            /*
+             * We found a cycle of length 1, so remember that to pass down to
+             * successor states.  (It doesn't matter if there was also such a
+             * loop at a predecessor state.)
+             */
+            foundloop = true;
+        }
+        else if (a->to->tmp)
+        {
+            /* We found a cycle of length > 1, so fail. */
+            return false;
+        }
+    }
+
+    /* We need to recurse, so mark state as under consideration */
+    assert(s->tmp == NULL);
+    s->tmp = s;
+
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->co != RAINBOW)
+            continue;            /* ignore pseudocolor transitions */
+        if (a->to == nfa->post)
+        {
+            /* We found an all-RAINBOW path to the post state */
+            result = true;
+            /* Record potential match lengths */
+            assert(depth >= 0);
+            hasmatch[depth] = true;
+            if (foundloop)
+            {
+                /* A predecessor loop makes all larger lengths match, too */
+                int            i;
+
+                for (i = depth + 1; i <= DUPINF; i++)
+                    hasmatch[i] = true;
+            }
+        }
+        else if (a->to != s)
+        {
+            /* This is a new path forward; recurse to investigate */
+            result = checkmatchall_recurse(nfa, a->to,
+                                           foundloop, depth + 1,
+                                           hasmatch);
+            /* Fail if any recursive path fails */
+            if (!result)
+                break;
+        }
+    }
+
+    s->tmp = NULL;
+    return result;
+}
+
 /*
  * compact - construct the compact representation of an NFA
  */
@@ -2930,7 +3118,9 @@ compact(struct nfa *nfa,
     cnfa->eos[0] = nfa->eos[0];
     cnfa->eos[1] = nfa->eos[1];
     cnfa->ncolors = maxcolor(nfa->cm) + 1;
-    cnfa->flags = 0;
+    cnfa->flags = nfa->flags;
+    cnfa->minmatchall = nfa->minmatchall;
+    cnfa->maxmatchall = nfa->maxmatchall;

     ca = cnfa->arcs;
     for (s = nfa->states; s != NULL; s = s->next)
@@ -3034,6 +3224,11 @@ dumpnfa(struct nfa *nfa,
         fprintf(f, ", eos [%ld]", (long) nfa->eos[0]);
     if (nfa->eos[1] != COLORLESS)
         fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
+    if (nfa->flags & HASLACONS)
+        fprintf(f, ", haslacons");
+    if (nfa->flags & MATCHALL)
+        fprintf(f, ", minmatchall %d, maxmatchall %d",
+                nfa->minmatchall, nfa->maxmatchall);
     fprintf(f, "\n");
     for (s = nfa->states; s != NULL; s = s->next)
     {
@@ -3201,6 +3396,9 @@ dumpcnfa(struct cnfa *cnfa,
         fprintf(f, ", eol [%ld]", (long) cnfa->eos[1]);
     if (cnfa->flags & HASLACONS)
         fprintf(f, ", haslacons");
+    if (cnfa->flags & MATCHALL)
+        fprintf(f, ", minmatchall %d, maxmatchall %d",
+                cnfa->minmatchall, cnfa->maxmatchall);
     fprintf(f, "\n");
     for (st = 0; st < cnfa->nstates; st++)
         dumpcstate(st, cnfa, f);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 5956b86026..6ca5f5cf4c 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -175,6 +175,9 @@ static void cleanup(struct nfa *);
 static void markreachable(struct nfa *, struct state *, struct state *, struct state *);
 static void markcanreach(struct nfa *, struct state *, struct state *, struct state *);
 static long analyze(struct nfa *);
+static void checkmatchall(struct nfa *);
+static bool checkmatchall_recurse(struct nfa *, struct state *,
+                                  bool, int, bool *);
 static void compact(struct nfa *, struct cnfa *);
 static void carcsort(struct carc *, size_t);
 static int    carc_cmp(const void *, const void *);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 32be2592c5..20ec463204 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -58,6 +58,29 @@ longest(struct vars *v,
     if (hitstopp != NULL)
         *hitstopp = 0;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = stop - start;
+        size_t        maxmatchall = d->cnfa->maxmatchall;
+
+        if (nchr < d->cnfa->minmatchall)
+            return NULL;
+        if (maxmatchall == DUPINF)
+        {
+            if (stop == v->stop && hitstopp != NULL)
+                *hitstopp = 1;
+        }
+        else
+        {
+            if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
+                *hitstopp = 1;
+            if (nchr > maxmatchall)
+                return start + maxmatchall;
+        }
+        return stop;
+    }
+
     /* initialize */
     css = initialize(v, d, start);
     if (css == NULL)
@@ -187,6 +210,24 @@ shortest(struct vars *v,
     if (hitstopp != NULL)
         *hitstopp = 0;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = min - start;
+
+        if (d->cnfa->maxmatchall != DUPINF &&
+            nchr > d->cnfa->maxmatchall)
+            return NULL;
+        if ((max - start) < d->cnfa->minmatchall)
+            return NULL;
+        if (nchr < d->cnfa->minmatchall)
+            min = start + d->cnfa->minmatchall;
+        if (coldp != NULL)
+            *coldp = start;
+        /* there is no case where we should set *hitstopp */
+        return min;
+    }
+
     /* initialize */
     css = initialize(v, d, start);
     if (css == NULL)
@@ -312,6 +353,22 @@ matchuntil(struct vars *v,
     struct sset *ss;
     struct colormap *cm = d->cm;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = probe - v->start;
+
+        /*
+         * It might seem that we should check maxmatchall too, but the
+         * implicit .* at the front of the pattern absorbs any extra
+         * characters.  Thus, we should always match as long as there are at
+         * least minmatchall characters.
+         */
+        if (nchr < d->cnfa->minmatchall)
+            return 0;
+        return 1;
+    }
+
     /* initialize and startup, or restart, if necessary */
     if (cp == NULL || cp > probe)
     {
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index e2fbad7a8a..ec435b6f5f 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -77,6 +77,10 @@ pg_regprefix(regex_t *re,
     assert(g->tree != NULL);
     cnfa = &g->tree->cnfa;

+    /* matchall NFAs never have a fixed prefix */
+    if (cnfa->flags & MATCHALL)
+        return REG_NOMATCH;
+
     /*
      * Since a correct NFA should never contain any exit-free loops, it should
      * not be possible for our traversal to return to a previously visited NFA
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 5bcd669d59..956b37b72d 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -331,6 +331,9 @@ struct nfa
     struct colormap *cm;        /* the color map */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
     color        eos[2];            /* colors, if any, assigned to EOS and EOL */
+    int            flags;            /* flags to pass forward to cNFA */
+    int            minmatchall;    /* min number of chrs to match, if matchall */
+    int            maxmatchall;    /* max number of chrs to match, or DUPINF */
     struct vars *v;                /* simplifies compile error reporting */
     struct nfa *parent;            /* parent NFA, if any */
 };
@@ -353,6 +356,14 @@ struct nfa
  *
  * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
  * it doesn't break the rule about how to recognize LACON arcs.
+ *
+ * We have special markings for "trivial" NFAs that can match any string
+ * (possibly with limits on the number of characters therein).  In such a
+ * case, flags & MATCHALL is set (and HASLACONS can't be set).  Then the
+ * fields minmatchall and maxmatchall give the minimum and maximum numbers
+ * of characters to match.  For example, ".*" produces minmatchall = 0
+ * and maxmatchall = DUPINF, while ".+" produces minmatchall = 1 and
+ * maxmatchall = DUPINF.
  */
 struct carc
 {
@@ -366,6 +377,7 @@ struct cnfa
     int            ncolors;        /* number of colors (max color in use + 1) */
     int            flags;
 #define  HASLACONS    01            /* uses lookaround constraints */
+#define  MATCHALL    02            /* matches all strings of a range of lengths */
     int            pre;            /* setup state number */
     int            post;            /* teardown state number */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
@@ -375,6 +387,9 @@ struct cnfa
     struct carc **states;        /* vector of pointers to outarc lists */
     /* states[n] are pointers into a single malloc'd array of arcs */
     struct carc *arcs;            /* the area for the lists */
+    /* these fields are used only in a MATCHALL NFA (else they're -1): */
+    int            minmatchall;    /* min number of chrs to match */
+    int            maxmatchall;    /* max number of chrs to match, or DUPINF */
 };

 #define ZAPCNFA(cnfa)    ((cnfa).nstates = 0)
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 0dc2265d8b..90dec92019 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3376,6 +3376,31 @@ select * from test_regex('(?<=b)b', 'b', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)

+-- expectMatch    23.19 HP        (?<=..)a*    aaabb
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {a}
+(2 rows)
+
+-- expectMatch    23.20 HP        (?<=..)b*    aaabb
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {""}
+(2 rows)
+
+-- expectMatch    23.21 HP        (?<=..)b+    aaabb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bb}
+(2 rows)
+
 -- doing 24 "non-greedy quantifiers"
 -- expectMatch    24.1  PT    ab+?        abb    ab
 select * from test_regex('ab+?', 'abb', 'PT');
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 1a2bfa6235..506924e904 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1068,6 +1068,13 @@ select * from test_regex('a(?<!b)b*', 'a', 'HP');
 select * from test_regex('(?<=b)b', 'bb', 'HP');
 -- expectNomatch    23.18 HP        (?<=b)b        b
 select * from test_regex('(?<=b)b', 'b', 'HP');
+-- expectMatch    23.19 HP        (?<=..)a*    aaabb
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+-- expectMatch    23.20 HP        (?<=..)b*    aaabb
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+-- expectMatch    23.21 HP        (?<=..)b+    aaabb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');

 -- doing 24 "non-greedy quantifiers"

Re: Some regular-expression performance hacking

От

"Joel Jacobson"

Дата:

13 февраля 2021 г., 17:19:34

Hi Tom,

On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote:

>0001-invent-rainbow-arcs.patch

>0002-recognize-matchall-NFAs.patch

Many thanks for working on the regex engine,

this looks like an awesome optimization.

To test the correctness of the patches,

I thought it would be nice with some real-life regexes,

and just as important, some real-life text strings,

to which the real-life regexes are applied to.

I therefore patched Chromium's v8 regexes engine,

to log the actual regexes that get compiled when

visiting websites, and also the text strings that

are the regexes are applied to during run-time

when the regexes are executed.

I logged the regex and text strings as base64 encoded

strings to STDOUT, to make it easy to grep out the data,

so it could be imported into PostgreSQL for analytics.

In total, I scraped the first-page of some ~50k websites,

which produced 45M test rows to import,

which when GROUP BY pattern and flags was reduced

down to 235k different regex patterns,

and 1.5M different text string subjects.

Here are some statistics on the different flags used:

SELECT *, SUM(COUNT) OVER () FROM (SELECT flags, COUNT(*) FROM patterns GROUP BY flags) AS x ORDER BY COUNT DESC;

flags | count | sum

-------+--------+--------

| 150097 | 235204

i | 43537 | 235204

g | 22029 | 235204

gi | 15416 | 235204

gm | 2411 | 235204

gim | 602 | 235204

m | 548 | 235204

im | 230 | 235204

y | 193 | 235204

gy | 60 | 235204

giy | 29 | 235204

giu | 26 | 235204

u | 11 | 235204

iy | 6 | 235204

gu | 5 | 235204

gimu | 2 | 235204

iu | 1 | 235204

my | 1 | 235204

(18 rows)

As we can see, no flag at all is the most common, followed by the "i" flag.

Most of the Javascript-regexes (97%) could be understood by PostgreSQL,

only 3% produced some kind of error, which is not unexpected,

since some Javascript-regex features like \w and \W have different

syntax in PostgreSQL:

SELECT *, SUM(COUNT) OVER () FROM (SELECT is_match,error,COUNT(*) FROM subjects GROUP BY is_match,error) AS x ORDER BY count DESC;

is_match | error | count | sum

----------+---------------------------------------------------------------+--------+---------

f | | 973987 | 1489489

t | | 474225 | 1489489

| invalid regular expression: invalid escape \ sequence | 39141 | 1489489

| invalid regular expression: invalid character range | 898 | 1489489

| invalid regular expression: invalid backreference number | 816 | 1489489

| invalid regular expression: brackets [] not balanced | 327 | 1489489

| invalid regular expression: invalid repetition count(s) | 76 | 1489489

| invalid regular expression: quantifier operand invalid | 17 | 1489489

| invalid regular expression: parentheses () not balanced | 1 | 1489489

| invalid regular expression: regular expression is too complex | 1 | 1489489

(10 rows)

Having had some fun looking at statistics, let's move on to look at if there are any

observable differences between HEAD (8063d0f6f56e53edd991f53aadc8cb7f8d3fdd8f)

and when these two patches have been applied.

To detect any differences,

for each (regex pattern, text string subject) pair,

the columns,

is_match boolean

captured text[]

error text

were set by a PL/pgSQL function running HEAD:

BEGIN

_is_match := _subject ~ _pattern;

_captured := regexp_match(_subject, _pattern);

EXCEPTION WHEN OTHERS THEN

UPDATE subjects SET

error = SQLERRM

WHERE subject_id = _subject_id;

CONTINUE;

END;

UPDATE subjects SET

is_match = _is_match,

captured = _captured

WHERE subject_id = _subject_id;

The patches

0001-invent-rainbow-arcs.patch

0002-recognize-matchall-NFAs.patch

were then applied and this query was executed to spot any differences:

SELECT

is_match <> (subject ~ pattern) AS is_match_diff,

captured IS DISTINCT FROM regexp_match(subject, pattern) AS captured_diff,

COUNT(*)

FROM subjects

WHERE error IS NULL

AND (is_match <> (subject ~ pattern) OR captured IS DISTINCT FROM regexp_match(subject, pattern))

GROUP BY 1,2

ORDER BY 3 DESC

;

The query was first run on the unpatched HEAD to verify it detects no results.

0 rows indeed, and it took this long to finish the query:

Time: 186077.866 ms (03:06.078)

Running the same query with the two patches, was significantly faster:

Time: 111785.735 ms (01:51.786)

No is_match differences were detected, good!

However, there were 23 cases where what got captured differed:

-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$

subject | v-cloak

is_match_head | t

captured_head | {cloak,NULL,NULL}

is_match_patch | t

captured_patch | {NULL,NULL,v-cloak}

-[ RECORD 2 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$

subject | v-if

is_match_head | t

captured_head | {if,NULL,NULL}

is_match_patch | t

captured_patch | {NULL,NULL,v-if}

-[ RECORD 3 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?a5oc.com).*

subject | https://a5oc.com/attachments/6b184e79-6a7f-43e0-ac59-7ed9d0a8eb7e-jpeg.179582/

is_match_head | t

captured_head | {https://,a5oc.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 4 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?allfordmustangs.com).*

subject | https://allfordmustangs.com/attachments/e463e329-0397-4e13-ad41-f30c6bc0659e-jpeg.779299/

is_match_head | t

captured_head | {https://,allfordmustangs.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 5 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?audi-forums.com).*

subject | https://audi-forums.com/attachments/screenshot_20210207-151100_ebay-jpg.11506/

is_match_head | t

captured_head | {https://,audi-forums.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 6 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?can-amforum.com).*

subject | https://can-amforum.com/attachments/resized_20201214_163325-jpeg.101395/

is_match_head | t

captured_head | {https://,can-amforum.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 7 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?contractortalk.com).*

subject | https://contractortalk.com/attachments/maryann-porch-roof-quote-12feb2021-jpg.508976/

is_match_head | t

captured_head | {https://,contractortalk.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 8 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?halloweenforum.com).*

subject | https://halloweenforum.com/attachments/dead-fred-head-before-and-after-jpg.744080/

is_match_head | t

captured_head | {https://,halloweenforum.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 9 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?horseforum.com).*

subject | https://horseforum.com/attachments/dd90f089-9ae9-4521-98cd-27bda9ad38e9-jpeg.1109329/

is_match_head | t

captured_head | {https://,horseforum.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 10 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?passatworld.com).*

subject | https://passatworld.com/attachments/clean-passat-jpg.102337/

is_match_head | t

captured_head | {https://,passatworld.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 11 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?plantedtank.net).*

subject | https://plantedtank.net/attachments/brendon-60p-jpg.1026075/

is_match_head | t

captured_head | {https://,plantedtank.net,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 12 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?vauxhallownersnetwork.co.uk).*

subject | https://vauxhallownersnetwork.co.uk/attachments/opelnavi-jpg.96639/

is_match_head | t

captured_head | {https://,vauxhallownersnetwork.co.uk,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 13 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?volvov40club.com).*

subject | https://volvov40club.com/attachments/img_20210204_164157-jpg.17356/

is_match_head | t

captured_head | {https://,volvov40club.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 14 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?vwidtalk.com).*

subject | https://vwidtalk.com/attachments/1613139846689-png.1469/

is_match_head | t

captured_head | {https://,vwidtalk.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 15 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^.*://)?((www.)?yellowbullet.com).*

subject | https://yellowbullet.com/attachments/20210211_133934-jpg.204604/

is_match_head | t

captured_head | {https://,yellowbullet.com,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 16 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^[^\?]*)?(\?[^#]*)?(#.*$)?

subject | https://www.disneyonice.com/oneIdResponder.html

is_match_head | t

captured_head | {https://www.disneyonice.com/oneIdResponder.html,NULL,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 17 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?

subject | /

is_match_head | t

captured_head | {/,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 18 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?

subject | /en.html

is_match_head | t

captured_head | {/en,.html}

is_match_patch | t

captured_patch |

-[ RECORD 19 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | (^https?:\/\/)?(((\[[^\]]+\])|([^:\/\?#]+))(:(\d+))?)?([^\?#]*)(.*)?

subject | https://e.echatsoft.com/mychat/visitor

is_match_head | t

captured_head | {https://,e.echatsoft.com,e.echatsoft.com,NULL,e.echatsoft.com,NULL,NULL,/mychat/visitor,""}

is_match_patch | t

captured_patch | {NULL,https,https,NULL,https,NULL,NULL,://e.echatsoft.com/mychat/visitor,""}

-[ RECORD 20 ]-+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------

pattern | (^|.)41nbc.com$|(^|.)41nbc.dev$|(^|.)52.23.179.12$|(^|.)52.3.245.221$|(^|.)clipsyndicate.com$|(^|.)michaelbgiordano.com$|(^|.)syndicaster.tv$|(^|.)wdef.com$|(^|.)wdef.dev$|(^|.)wxxv.mysiteserver.net$|(^|.)wxxv25.dev$|(^|.)clipsyndicate.com$|(^|.)syndicaster.tv$

subject | wdef.com

is_match_head | t

captured_head | {NULL,NULL,NULL,NULL,NULL,NULL,NULL,"",NULL,NULL,NULL,NULL,NULL}

is_match_patch | t

captured_patch |

-[ RECORD 21 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | ^((^\w+:|^)\/\/)?(?:www\.)?

subject | https://www.deputy.com/

is_match_head | t

captured_head | {https://,https:}

is_match_patch | t

captured_patch |

-[ RECORD 22 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | ^((^\w+:|^)\/\/)?(?:www\.)?

subject | https://www.westernsydney.edu.au/

is_match_head | t

captured_head | {https://,https:}

is_match_patch | t

captured_patch |

-[ RECORD 23 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pattern | ^(https?:){0,1}\/\/|

subject | https://ui.powerreviews.com/api/

is_match_head | t

captured_head | {https:}

is_match_patch | t

captured_patch | {NULL}

The code to reproduce the results have been pushed here:

https://github.com/truthly/regexes-in-the-wild

Let me know if you want access to the dataset,

I could open up a port to my PostgreSQL so you could take a dump.

SELECT

pg_size_pretty(pg_relation_size('patterns')) AS patterns,

pg_size_pretty(pg_relation_size('subjects')) AS subjects;

patterns | subjects

----------+----------

20 MB | 568 MB

(1 row)

/Joel

Re: Some regular-expression performance hacking

От

Tom Lane

Дата:

13 февраля 2021 г., 17:35:45

"Joel Jacobson" <joel@compiler.org> writes:
> In total, I scraped the first-page of some ~50k websites,
> which produced 45M test rows to import,
> which when GROUP BY pattern and flags was reduced
> down to 235k different regex patterns,
> and 1.5M different text string subjects.

This seems like an incredibly useful test dataset.
I'd definitely like a copy.

> No is_match differences were detected, good!

Cool ...

> However, there were 23 cases where what got captured differed:

I shall take a closer look at that.

Many thanks for doing this work!

            regards, tom lane

Re: Some regular-expression performance hacking

От

Tom Lane

Дата:

13 февраля 2021 г., 21:11:37

"Joel Jacobson" <joel@compiler.org> writes:
> No is_match differences were detected, good!
> However, there were 23 cases where what got captured differed:

These all stem from the same oversight: checkmatchall() was being
too cavalier by ignoring "pseudocolor" arcs, which are arcs that
match start-of-string or end-of-string markers.  I'd supposed that
pseudocolor arcs necessarily match parallel RAINBOW arcs, because
they start out that way (cf. newnfa).  But it turns out that
some edge-of-string constraints can be optimized in such a way that
they only appear in the final NFA in the guise of missing or extra
pseudocolor arcs.  We have to actually check that the pseudocolor arcs
match the RAINBOW arcs, otherwise our "matchall" NFA isn't one because
it acts differently at the start or end of the string than it does
elsewhere.

So here's a revised pair of patches (0001 is actually the same as
before).

Thanks again for testing!

            regards, tom lane

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1e4f0121f3..fcf03de32d 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -282,8 +282,8 @@ typedef struct
 typedef int TrgmColor;

 /* We assume that colors returned by the regexp engine cannot be these: */
-#define COLOR_UNKNOWN    (-1)
-#define COLOR_BLANK        (-2)
+#define COLOR_UNKNOWN    (-3)
+#define COLOR_BLANK        (-4)

 typedef struct
 {
@@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
         palloc0(colorsCount * sizeof(TrgmColorInfo));

     /*
-     * Loop over colors, filling TrgmColorInfo about each.
+     * Loop over colors, filling TrgmColorInfo about each.  Note we include
+     * WHITE (0) even though we know it'll be reported as non-expandable.
      */
     for (i = 0; i < colorsCount; i++)
     {
@@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
             /* Add enter key to this state */
             addKeyToQueue(trgmNFA, &destKey);
         }
-        else
+        else if (arc->co >= 0)
         {
-            /* Regular color */
+            /* Regular color (including WHITE) */
             TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];

             if (colorInfo->expandable)
@@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
                 addKeyToQueue(trgmNFA, &destKey);
             }
         }
+        else
+        {
+            /* RAINBOW: treat as unexpandable color */
+            destKey.prefix.colors[0] = COLOR_UNKNOWN;
+            destKey.prefix.colors[1] = COLOR_UNKNOWN;
+            destKey.nstate = arc->to;
+            addKeyToQueue(trgmNFA, &destKey);
+        }
     }

     pfree(arcs);
@@ -1216,10 +1225,10 @@ addArcs(TrgmNFA *trgmNFA, TrgmState *state)
             /*
              * Ignore non-expandable colors; addKey already handled the case.
              *
-             * We need no special check for begin/end pseudocolors here.  We
-             * don't need to do any processing for them, and they will be
-             * marked non-expandable since the regex engine will have reported
-             * them that way.
+             * We need no special check for WHITE or begin/end pseudocolors
+             * here.  We don't need to do any processing for them, and they
+             * will be marked non-expandable since the regex engine will have
+             * reported them that way.
              */
             if (!colorInfo->expandable)
                 continue;
diff --git a/src/backend/regex/README b/src/backend/regex/README
index f08aab69e3..cc1834b89c 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -261,6 +261,18 @@ and the NFA has these arcs:
     states 4 -> 5 on color 2 ("x" only)
 which can be seen to be a correct representation of the regex.

+There is one more complexity, which is how to handle ".", that is a
+match-anything atom.  We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot.  That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic.  Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors.  Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline.  In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
 Given this summary, we can see we need the following operations for
 colors:

@@ -349,6 +361,8 @@ The possible arc types are:

     PLAIN arcs, which specify matching of any character of a given "color"
     (see above).  These are dumped as "[color_number]->to_state".
+    In addition there can be "rainbow" PLAIN arcs, which are dumped as
+    "[*]->to_state".

     EMPTY arcs, which specify a no-op transition to another state.  These
     are dumped as "->to_state".
@@ -356,11 +370,11 @@ The possible arc types are:
     AHEAD constraints, which represent a "next character must be of this
     color" constraint.  AHEAD differs from a PLAIN arc in that the input
     character is not consumed when crossing the arc.  These are dumped as
-    ">color_number>->to_state".
+    ">color_number>->to_state", or possibly ">*>->to_state".

     BEHIND constraints, which represent a "previous character must be of
     this color" constraint, which likewise consumes no input.  These are
-    dumped as "<color_number<->to_state".
+    dumped as "<color_number<->to_state", or possibly "<*<->to_state".

     '^' arcs, which specify a beginning-of-input constraint.  These are
     dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index f5a4151757..0864011cce 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -977,6 +977,7 @@ colorchain(struct colormap *cm,
 {
     struct colordesc *cd = &cm->cd[a->co];

+    assert(a->co >= 0);
     if (cd->arcs != NULL)
         cd->arcs->colorchainRev = a;
     a->colorchain = cd->arcs;
@@ -994,6 +995,7 @@ uncolorchain(struct colormap *cm,
     struct colordesc *cd = &cm->cd[a->co];
     struct arc *aa = a->colorchainRev;

+    assert(a->co >= 0);
     if (aa == NULL)
     {
         assert(cd->arcs == a);
@@ -1012,6 +1014,9 @@ uncolorchain(struct colormap *cm,

 /*
  * rainbow - add arcs of all full colors (but one) between specified states
+ *
+ * If there isn't an exception color, we now generate just a single arc
+ * labeled RAINBOW, saving lots of arc-munging later on.
  */
 static void
 rainbow(struct nfa *nfa,
@@ -1025,6 +1030,13 @@ rainbow(struct nfa *nfa,
     struct colordesc *end = CDEND(cm);
     color        co;

+    if (but == COLORLESS)
+    {
+        newarc(nfa, type, RAINBOW, from, to);
+        return;
+    }
+
+    /* Gotta do it the hard way.  Skip subcolors, pseudocolors, and "but" */
     for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
         if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
             !(cd->flags & PSEUDO))
@@ -1034,13 +1046,16 @@ rainbow(struct nfa *nfa,
 /*
  * colorcomplement - add arcs of complementary colors
  *
+ * We add arcs of all colors that are not pseudocolors and do not match
+ * any of the "of" state's PLAIN outarcs.
+ *
  * The calling sequence ought to be reconciled with cloneouts().
  */
 static void
 colorcomplement(struct nfa *nfa,
                 struct colormap *cm,
                 int type,
-                struct state *of,    /* complements of this guy's PLAIN outarcs */
+                struct state *of,
                 struct state *from,
                 struct state *to)
 {
@@ -1049,6 +1064,11 @@ colorcomplement(struct nfa *nfa,
     color        co;

     assert(of != from);
+
+    /* A RAINBOW arc matches all colors, making the complement empty */
+    if (findarc(of, PLAIN, RAINBOW) != NULL)
+        return;
+
     for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
         if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
             if (findarc(of, PLAIN, co) == NULL)
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 92c9c4d795..1ac030570d 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -271,6 +271,11 @@ destroystate(struct nfa *nfa,
  *
  * This function checks to make sure that no duplicate arcs are created.
  * In general we never want duplicates.
+ *
+ * However: in principle, a RAINBOW arc is redundant with any plain arc
+ * (unless that arc is for a pseudocolor).  But we don't try to recognize
+ * that redundancy, either here or in allied operations such as moveins().
+ * The pseudocolor consideration makes that more costly than it seems worth.
  */
 static void
 newarc(struct nfa *nfa,
@@ -1170,6 +1175,9 @@ copyouts(struct nfa *nfa,

 /*
  * cloneouts - copy out arcs of a state to another state pair, modifying type
+ *
+ * This is only used to convert PLAIN arcs to AHEAD/BEHIND arcs, which share
+ * the same interpretation of "co".  It wouldn't be sensible with LACONs.
  */
 static void
 cloneouts(struct nfa *nfa,
@@ -1181,9 +1189,13 @@ cloneouts(struct nfa *nfa,
     struct arc *a;

     assert(old != from);
+    assert(type == AHEAD || type == BEHIND);

     for (a = old->outs; a != NULL; a = a->outchain)
+    {
+        assert(a->type == PLAIN);
         newarc(nfa, type, a->co, from, to);
+    }
 }

 /*
@@ -1597,7 +1609,7 @@ pull(struct nfa *nfa,
     for (a = from->ins; a != NULL && !NISERR(); a = nexta)
     {
         nexta = a->inchain;
-        switch (combine(con, a))
+        switch (combine(nfa, con, a))
         {
             case INCOMPATIBLE:    /* destroy the arc */
                 freearc(nfa, a);
@@ -1624,6 +1636,10 @@ pull(struct nfa *nfa,
                 cparc(nfa, a, s, to);
                 freearc(nfa, a);
                 break;
+            case REPLACEARC:    /* replace arc's color */
+                newarc(nfa, a->type, con->co, a->from, to);
+                freearc(nfa, a);
+                break;
             default:
                 assert(NOTREACHED);
                 break;
@@ -1764,7 +1780,7 @@ push(struct nfa *nfa,
     for (a = to->outs; a != NULL && !NISERR(); a = nexta)
     {
         nexta = a->outchain;
-        switch (combine(con, a))
+        switch (combine(nfa, con, a))
         {
             case INCOMPATIBLE:    /* destroy the arc */
                 freearc(nfa, a);
@@ -1791,6 +1807,10 @@ push(struct nfa *nfa,
                 cparc(nfa, a, from, s);
                 freearc(nfa, a);
                 break;
+            case REPLACEARC:    /* replace arc's color */
+                newarc(nfa, a->type, con->co, from, a->to);
+                freearc(nfa, a);
+                break;
             default:
                 assert(NOTREACHED);
                 break;
@@ -1810,9 +1830,11 @@ push(struct nfa *nfa,
  * #def INCOMPATIBLE    1    // destroys arc
  * #def SATISFIED        2    // constraint satisfied
  * #def COMPATIBLE        3    // compatible but not satisfied yet
+ * #def REPLACEARC        4    // replace arc's color with constraint color
  */
 static int
-combine(struct arc *con,
+combine(struct nfa *nfa,
+        struct arc *con,
         struct arc *a)
 {
 #define  CA(ct,at)     (((ct)<<CHAR_BIT) | (at))
@@ -1827,14 +1849,46 @@ combine(struct arc *con,
         case CA(BEHIND, PLAIN):
             if (con->co == a->co)
                 return SATISFIED;
+            if (con->co == RAINBOW)
+            {
+                /* con is satisfied unless arc's color is a pseudocolor */
+                if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+                    return SATISFIED;
+            }
+            else if (a->co == RAINBOW)
+            {
+                /* con is incompatible if it's for a pseudocolor */
+                if (nfa->cm->cd[con->co].flags & PSEUDO)
+                    return INCOMPATIBLE;
+                /* otherwise, constraint constrains arc to be only its color */
+                return REPLACEARC;
+            }
             return INCOMPATIBLE;
             break;
         case CA('^', '^'):        /* collision, similar constraints */
         case CA('$', '$'):
-        case CA(AHEAD, AHEAD):
+            if (con->co == a->co)    /* true duplication */
+                return SATISFIED;
+            return INCOMPATIBLE;
+            break;
+        case CA(AHEAD, AHEAD):    /* collision, similar constraints */
         case CA(BEHIND, BEHIND):
             if (con->co == a->co)    /* true duplication */
                 return SATISFIED;
+            if (con->co == RAINBOW)
+            {
+                /* con is satisfied unless arc's color is a pseudocolor */
+                if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+                    return SATISFIED;
+            }
+            else if (a->co == RAINBOW)
+            {
+                /* con is incompatible if it's for a pseudocolor */
+                if (nfa->cm->cd[con->co].flags & PSEUDO)
+                    return INCOMPATIBLE;
+                /* otherwise, constraint constrains arc to be only its color */
+                return REPLACEARC;
+            }
             return INCOMPATIBLE;
             break;
         case CA('^', BEHIND):    /* collision, dissimilar constraints */
@@ -2895,6 +2949,7 @@ compact(struct nfa *nfa,
                     break;
                 case LACON:
                     assert(s->no != cnfa->pre);
+                    assert(a->co >= 0);
                     ca->co = (color) (cnfa->ncolors + a->co);
                     ca->to = a->to->no;
                     ca++;
@@ -2902,7 +2957,7 @@ compact(struct nfa *nfa,
                     break;
                 default:
                     NERR(REG_ASSERT);
-                    break;
+                    return;
             }
         carcsort(first, ca - first);
         ca->co = COLORLESS;
@@ -3068,13 +3123,22 @@ dumparc(struct arc *a,
     switch (a->type)
     {
         case PLAIN:
-            fprintf(f, "[%ld]", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, "[*]");
+            else
+                fprintf(f, "[%ld]", (long) a->co);
             break;
         case AHEAD:
-            fprintf(f, ">%ld>", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, ">*>");
+            else
+                fprintf(f, ">%ld>", (long) a->co);
             break;
         case BEHIND:
-            fprintf(f, "<%ld<", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, "<*<");
+            else
+                fprintf(f, "<%ld<", (long) a->co);
             break;
         case LACON:
             fprintf(f, ":%ld:", (long) a->co);
@@ -3161,7 +3225,9 @@ dumpcstate(int st,
     pos = 1;
     for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
     {
-        if (ca->co < cnfa->ncolors)
+        if (ca->co == RAINBOW)
+            fprintf(f, "\t[*]->%d", ca->to);
+        else if (ca->co < cnfa->ncolors)
             fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
         else
             fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 91078dcd80..5956b86026 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -158,7 +158,8 @@ static int    push(struct nfa *, struct arc *, struct state **);
 #define INCOMPATIBLE    1        /* destroys arc */
 #define SATISFIED    2            /* constraint satisfied */
 #define COMPATIBLE    3            /* compatible but not satisfied yet */
-static int    combine(struct arc *, struct arc *);
+#define REPLACEARC    4            /* replace arc's color with constraint color */
+static int    combine(struct nfa *nfa, struct arc *con, struct arc *a);
 static void fixempties(struct nfa *, FILE *);
 static struct state *emptyreachable(struct nfa *, struct state *,
                                     struct state *, struct arc **);
@@ -289,9 +290,11 @@ struct vars
 #define SBEGIN    'A'                /* beginning of string (even if not BOL) */
 #define SEND    'Z'                /* end of string (even if not EOL) */

-/* is an arc colored, and hence on a color chain? */
+/* is an arc colored, and hence should belong to a color chain? */
+/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
 #define COLORED(a) \
-    ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)
+    ((a)->co >= 0 && \
+     ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))


 /* static function list */
@@ -1390,7 +1393,8 @@ bracket(struct vars *v,
  * cbracket - handle complemented bracket expression
  * We do it by calling bracket() with dummy endpoints, and then complementing
  * the result.  The alternative would be to invoke rainbow(), and then delete
- * arcs as the b.e. is seen... but that gets messy.
+ * arcs as the b.e. is seen... but that gets messy, and is really quite
+ * infeasible now that rainbow() just puts out one RAINBOW arc.
  */
 static void
 cbracket(struct vars *v,
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 5695e158a5..32be2592c5 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -612,6 +612,7 @@ miss(struct vars *v,
     unsigned    h;
     struct carc *ca;
     struct sset *p;
+    int            ispseudocolor;
     int            ispost;
     int            noprogress;
     int            gotstate;
@@ -643,13 +644,15 @@ miss(struct vars *v,
      */
     for (i = 0; i < d->wordsper; i++)
         d->work[i] = 0;            /* build new stateset bitmap in d->work */
+    ispseudocolor = d->cm->cd[co].flags & PSEUDO;
     ispost = 0;
     noprogress = 1;
     gotstate = 0;
     for (i = 0; i < d->nstates; i++)
         if (ISBSET(css->states, i))
             for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
-                if (ca->co == co)
+                if (ca->co == co ||
+                    (ca->co == RAINBOW && !ispseudocolor))
                 {
                     BSET(d->work, ca->to);
                     gotstate = 1;
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index d4f940b8c3..a493dbe88c 100644
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -222,7 +222,8 @@ pg_reg_colorisend(const regex_t *regex, int co)
  * Get number of member chrs of color number "co".
  *
  * Note: we return -1 if the color number is invalid, or if it is a special
- * color (WHITE or a pseudocolor), or if the number of members is uncertain.
+ * color (WHITE, RAINBOW, or a pseudocolor), or if the number of members is
+ * uncertain.
  * Callers should not try to extract the members if -1 is returned.
  */
 int
@@ -233,7 +234,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
     assert(regex != NULL && regex->re_magic == REMAGIC);
     cm = &((struct guts *) regex->re_guts)->cmap;

-    if (co <= 0 || co > cm->max)    /* we reject 0 which is WHITE */
+    if (co <= 0 || co > cm->max)    /* <= 0 rejects WHITE and RAINBOW */
         return -1;
     if (cm->cd[co].flags & PSEUDO)    /* also pseudocolors (BOS etc) */
         return -1;
@@ -257,7 +258,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
  * whose length chars_len must be at least as long as indicated by
  * pg_reg_getnumcharacters(), else not all chars will be returned.
  *
- * Fetching the members of WHITE or a pseudocolor is not supported.
+ * Fetching the members of WHITE, RAINBOW, or a pseudocolor is not supported.
  *
  * Caution: this is a relatively expensive operation.
  */
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 1d4593ac94..e2fbad7a8a 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -165,9 +165,13 @@ findprefix(struct cnfa *cnfa,
             /* We can ignore BOS/BOL arcs */
             if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
                 continue;
-            /* ... but EOS/EOL arcs terminate the search, as do LACONs */
+
+            /*
+             * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs
+             * and LACONs
+             */
             if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] ||
-                ca->co >= cnfa->ncolors)
+                ca->co == RAINBOW || ca->co >= cnfa->ncolors)
             {
                 thiscolor = COLORLESS;
                 break;
diff --git a/src/include/regex/regexport.h b/src/include/regex/regexport.h
index e6209463f7..99c4fb854e 100644
--- a/src/include/regex/regexport.h
+++ b/src/include/regex/regexport.h
@@ -30,6 +30,10 @@

 #include "regex/regex.h"

+/* These macros must match corresponding ones in regguts.h: */
+#define COLOR_WHITE        0        /* color for chars not appearing in regex */
+#define COLOR_RAINBOW    (-2)    /* represents all colors except pseudocolors */
+
 /* information about one arc of a regex's NFA */
 typedef struct
 {
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 5d0e7a961c..5bcd669d59 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -130,11 +130,16 @@
 /*
  * As soon as possible, we map chrs into equivalence classes -- "colors" --
  * which are of much more manageable number.
+ *
+ * To further reduce the number of arcs in NFAs and DFAs, we also have a
+ * special RAINBOW "color" that can be assigned to an arc.  This is not a
+ * real color, in that it has no entry in color maps.
  */
 typedef short color;            /* colors of characters */

 #define MAX_COLOR    32767        /* max color (must fit in 'color' datatype) */
 #define COLORLESS    (-1)        /* impossible color */
+#define RAINBOW        (-2)        /* represents all colors except pseudocolors */
 #define WHITE        0            /* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */

@@ -276,7 +281,7 @@ struct state;
 struct arc
 {
     int            type;            /* 0 if free, else an NFA arc type code */
-    color        co;
+    color        co;                /* color the arc matches (possibly RAINBOW) */
     struct state *from;            /* where it's from (and contained within) */
     struct state *to;            /* where it's to */
     struct arc *outchain;        /* link in *from's outs chain or free chain */
@@ -284,6 +289,7 @@ struct arc
 #define  freechain    outchain    /* we do not maintain "freechainRev" */
     struct arc *inchain;        /* link in *to's ins chain */
     struct arc *inchainRev;        /* back-link in *to's ins chain */
+    /* these fields are not used when co == RAINBOW: */
     struct arc *colorchain;        /* link in color's arc chain */
     struct arc *colorchainRev;    /* back-link in color's arc chain */
 };
@@ -344,6 +350,9 @@ struct nfa
  * Plain arcs just store the transition color number as "co".  LACON arcs
  * store the lookaround constraint number plus cnfa.ncolors as "co".  LACON
  * arcs can be distinguished from plain by testing for co >= cnfa.ncolors.
+ *
+ * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
+ * it doesn't break the rule about how to recognize LACON arcs.
  */
 struct carc
 {
diff --git a/src/backend/regex/README b/src/backend/regex/README
index cc1834b89c..a83ab5074d 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -410,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
 the end of the input.
 3. If the NFA is (or can be) in the goal state at this point, it matches.

+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any.  The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
 So one can mentally execute an untransformed NFA by taking ^ and $ as
 ordinary constraints that match at start and end of input; but plain
 arcs out of the start state should be taken as matches for the character
 before the target substring, and similarly, plain arcs leading to the
 post state are matches for the character after the target substring.
-This definition is necessary to support regexes that begin or end with
-constraints such as \m and \M, which imply requirements on the adjacent
-character if any.  NFAs for simple unanchored patterns will usually have
-pre-state outarcs for all possible character colors as well as BOS and
-BOL, and post-state inarcs for all possible character colors as well as
-EOS and EOL, so that the executor's behavior will work.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 1ac030570d..e5c5c65d84 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -65,6 +65,8 @@ newnfa(struct vars *v,
     nfa->v = v;
     nfa->bos[0] = nfa->bos[1] = COLORLESS;
     nfa->eos[0] = nfa->eos[1] = COLORLESS;
+    nfa->flags = 0;
+    nfa->minmatchall = nfa->maxmatchall = -1;
     nfa->parent = parent;        /* Precedes newfstate so parent is valid. */
     nfa->post = newfstate(nfa, '@');    /* number 0 */
     nfa->pre = newfstate(nfa, '>'); /* number 1 */
@@ -2875,8 +2877,14 @@ analyze(struct nfa *nfa)
     if (NISERR())
         return 0;

+    /* Detect whether NFA can't match anything */
     if (nfa->pre->outs == NULL)
         return REG_UIMPOSSIBLE;
+
+    /* Detect whether NFA matches all strings (possibly with length bounds) */
+    checkmatchall(nfa);
+
+    /* Detect whether NFA can possibly match a zero-length string */
     for (a = nfa->pre->outs; a != NULL; a = a->outchain)
         for (aa = a->to->outs; aa != NULL; aa = aa->outchain)
             if (aa->to == nfa->post)
@@ -2884,6 +2892,279 @@ analyze(struct nfa *nfa)
     return 0;
 }

+/*
+ * checkmatchall - does the NFA represent no more than a string length test?
+ *
+ * If so, set nfa->minmatchall and nfa->maxmatchall correctly (they are -1
+ * to begin with) and set the MATCHALL bit in nfa->flags.
+ *
+ * To succeed, we require all arcs to be PLAIN RAINBOW arcs, except for those
+ * for pseudocolors (i.e., BOS/BOL/EOS/EOL).  We must be able to reach the
+ * post state via RAINBOW arcs, and if there are any loops in the graph, they
+ * must be loop-to-self arcs, ensuring that each loop iteration consumes
+ * exactly one character.  (Longer loops are problematic because they create
+ * non-consecutive possible match lengths; we have no good way to represent
+ * that situation for lengths beyond the DUPINF limit.)
+ *
+ * Pseudocolor arcs complicate things a little.  We know that they can only
+ * appear as pre-state outarcs (for BOS/BOL) or post-state inarcs (for
+ * EOS/EOL).  There, they must exactly replicate the parallel RAINBOW arcs,
+ * e.g. if the pre state has one RAINBOW outarc to state 2, it must have BOS
+ * and BOL outarcs to state 2, and no others.  Missing or extra pseudocolor
+ * arcs can occur, meaning that the NFA involves some constraint on the
+ * adjacent characters, which makes it not a matchall NFA.
+ */
+static void
+checkmatchall(struct nfa *nfa)
+{
+    bool        hasmatch[DUPINF + 1];
+    int            minmatch,
+                maxmatch,
+                morematch;
+
+    /*
+     * hasmatch[i] will be set true if a match of length i is feasible, for i
+     * from 0 to DUPINF-1.  hasmatch[DUPINF] will be set true if every match
+     * length of DUPINF or more is feasible.
+     */
+    memset(hasmatch, 0, sizeof(hasmatch));
+
+    /*
+     * Recursively search the graph for all-RAINBOW paths to the "post" state,
+     * starting at the "pre" state.  The -1 initial depth accounts for the
+     * fact that transitions out of the "pre" state are not part of the
+     * matched string.  We likewise don't count the final transition to the
+     * "post" state as part of the match length.  (But we still insist that
+     * those transitions have RAINBOW arcs, otherwise there are lookbehind or
+     * lookahead constraints at the start/end of the pattern.)
+     */
+    if (!checkmatchall_recurse(nfa, nfa->pre, false, -1, hasmatch))
+        return;
+
+    /*
+     * We found some all-RAINBOW paths, and not anything that we couldn't
+     * handle.  Now verify that pseudocolor arcs adjacent to the pre and post
+     * states match the RAINBOW arcs there.  (We could do this while
+     * recursing, but it's expensive and unlikely to fail, so do it last.)
+     */
+    if (!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[0]) ||
+        !check_out_colors_match(nfa->pre, nfa->bos[0], RAINBOW) ||
+        !check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[1]) ||
+        !check_out_colors_match(nfa->pre, nfa->bos[1], RAINBOW))
+        return;
+    if (!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[0]) ||
+        !check_in_colors_match(nfa->post, nfa->eos[0], RAINBOW) ||
+        !check_in_colors_match(nfa->post, RAINBOW, nfa->eos[1]) ||
+        !check_in_colors_match(nfa->post, nfa->eos[1], RAINBOW))
+        return;
+
+    /*
+     * hasmatch[] now represents the set of possible match lengths; but we
+     * want to reduce that to a min and max value, because it doesn't seem
+     * worth complicating regexec.c to deal with nonconsecutive possible match
+     * lengths.  Find min and max of first run of lengths, then verify there
+     * are no nonconsecutive lengths.
+     */
+    for (minmatch = 0; minmatch <= DUPINF; minmatch++)
+    {
+        if (hasmatch[minmatch])
+            break;
+    }
+    assert(minmatch <= DUPINF); /* else checkmatchall_recurse lied */
+    for (maxmatch = minmatch; maxmatch < DUPINF; maxmatch++)
+    {
+        if (!hasmatch[maxmatch + 1])
+            break;
+    }
+    for (morematch = maxmatch + 1; morematch <= DUPINF; morematch++)
+    {
+        if (hasmatch[morematch])
+            return;                /* fail, there are nonconsecutive lengths */
+    }
+
+    /* Success, so record the info */
+    nfa->minmatchall = minmatch;
+    nfa->maxmatchall = maxmatch;
+    nfa->flags |= MATCHALL;
+}
+
+/*
+ * checkmatchall_recurse - recursive search for checkmatchall
+ *
+ * s is the current state
+ * foundloop is true if any predecessor state has a loop-to-self
+ * depth is the current recursion depth (starting at -1)
+ * hasmatch[] is the output area for recording feasible match lengths
+ *
+ * We return true if there is at least one all-RAINBOW path to the "post"
+ * state and no non-matchall paths; otherwise false.  Note we assume that
+ * any dead-end paths have already been removed, else we might return
+ * false unnecessarily.
+ */
+static bool
+checkmatchall_recurse(struct nfa *nfa, struct state *s,
+                      bool foundloop, int depth,
+                      bool *hasmatch)
+{
+    bool        result = false;
+    struct arc *a;
+
+    /*
+     * Since this is recursive, it could be driven to stack overflow.  But we
+     * need not treat that as a hard failure; just deem the NFA non-matchall.
+     */
+    if (STACK_TOO_DEEP(nfa->v->re))
+        return false;
+
+    /*
+     * Likewise, if we get to a depth too large to represent correctly in
+     * maxmatchall, fail quietly.
+     */
+    if (depth >= DUPINF)
+        return false;
+
+    /*
+     * Scan the outarcs to detect cases we can't handle, and to see if there
+     * is a loop-to-self here.  We need to know about any such loop before we
+     * recurse, so it's hard to avoid making two passes over the outarcs.  In
+     * any case, checking for showstoppers before we recurse is probably best.
+     */
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->type != PLAIN)
+            return false;        /* any LACONs make it non-matchall */
+        if (a->co != RAINBOW)
+        {
+            if (nfa->cm->cd[a->co].flags & PSEUDO)
+            {
+                /*
+                 * Pseudocolor arc: verify it's in a valid place (this seems
+                 * quite unlikely to fail, but let's be sure).
+                 */
+                if (s == nfa->pre &&
+                    (a->co == nfa->bos[0] || a->co == nfa->bos[1]))
+                     /* okay BOS/BOL arc */ ;
+                else if (a->to == nfa->post &&
+                         (a->co == nfa->eos[0] || a->co == nfa->eos[1]))
+                     /* okay EOS/EOL arc */ ;
+                else
+                    return false;    /* unexpected pseudocolor arc */
+                /* We'll finish checking these arcs after the recursion */
+                continue;
+            }
+            return false;        /* any other color makes it non-matchall */
+        }
+        if (a->to == s)
+        {
+            /*
+             * We found a cycle of length 1, so remember that to pass down to
+             * successor states.  (It doesn't matter if there was also such a
+             * loop at a predecessor state.)
+             */
+            foundloop = true;
+        }
+        else if (a->to->tmp)
+        {
+            /* We found a cycle of length > 1, so fail. */
+            return false;
+        }
+    }
+
+    /* We need to recurse, so mark state as under consideration */
+    assert(s->tmp == NULL);
+    s->tmp = s;
+
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->co != RAINBOW)
+            continue;            /* ignore pseudocolor arcs */
+        if (a->to == nfa->post)
+        {
+            /* We found an all-RAINBOW path to the post state */
+            result = true;
+            /* Record potential match lengths */
+            assert(depth >= 0);
+            hasmatch[depth] = true;
+            if (foundloop)
+            {
+                /* A predecessor loop makes all larger lengths match, too */
+                int            i;
+
+                for (i = depth + 1; i <= DUPINF; i++)
+                    hasmatch[i] = true;
+            }
+        }
+        else if (a->to != s)
+        {
+            /* This is a new path forward; recurse to investigate */
+            result = checkmatchall_recurse(nfa, a->to,
+                                           foundloop, depth + 1,
+                                           hasmatch);
+            /* Fail if any recursive path fails */
+            if (!result)
+                break;
+        }
+    }
+
+    s->tmp = NULL;
+    return result;
+}
+
+/*
+ * check_out_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s outarc of color co1 has a matching outarc of color co2.
+ * (checkmatchall_recurse already verified that all of the outarcs are PLAIN,
+ * so we need not examine arc types here.)
+ */
+static bool
+check_out_colors_match(struct state *s, color co1, color co2)
+{
+    struct arc *a1;
+    struct arc *a2;
+
+    for (a1 = s->outs; a1 != NULL; a1 = a1->outchain)
+    {
+        if (a1->co != co1)
+            continue;
+        for (a2 = s->outs; a2 != NULL; a2 = a2->outchain)
+        {
+            if (a2->co == co2 && a2->to == a1->to)
+                break;
+        }
+        if (a2 == NULL)
+            return false;
+    }
+    return true;
+}
+
+/*
+ * check_in_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s inarc of color co1 has a matching inarc of color co2.
+ * (For paranoia's sake, ignore any non-PLAIN arcs here.)
+ */
+static bool
+check_in_colors_match(struct state *s, color co1, color co2)
+{
+    struct arc *a1;
+    struct arc *a2;
+
+    for (a1 = s->ins; a1 != NULL; a1 = a1->inchain)
+    {
+        if (a1->type != PLAIN || a1->co != co1)
+            continue;
+        for (a2 = s->ins; a2 != NULL; a2 = a2->inchain)
+        {
+            if (a2->type == PLAIN && a2->co == co2 && a2->from == a1->from)
+                break;
+        }
+        if (a2 == NULL)
+            return false;
+    }
+    return true;
+}
+
 /*
  * compact - construct the compact representation of an NFA
  */
@@ -2930,7 +3211,9 @@ compact(struct nfa *nfa,
     cnfa->eos[0] = nfa->eos[0];
     cnfa->eos[1] = nfa->eos[1];
     cnfa->ncolors = maxcolor(nfa->cm) + 1;
-    cnfa->flags = 0;
+    cnfa->flags = nfa->flags;
+    cnfa->minmatchall = nfa->minmatchall;
+    cnfa->maxmatchall = nfa->maxmatchall;

     ca = cnfa->arcs;
     for (s = nfa->states; s != NULL; s = s->next)
@@ -3034,6 +3317,11 @@ dumpnfa(struct nfa *nfa,
         fprintf(f, ", eos [%ld]", (long) nfa->eos[0]);
     if (nfa->eos[1] != COLORLESS)
         fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
+    if (nfa->flags & HASLACONS)
+        fprintf(f, ", haslacons");
+    if (nfa->flags & MATCHALL)
+        fprintf(f, ", minmatchall %d, maxmatchall %d",
+                nfa->minmatchall, nfa->maxmatchall);
     fprintf(f, "\n");
     for (s = nfa->states; s != NULL; s = s->next)
     {
@@ -3201,6 +3489,9 @@ dumpcnfa(struct cnfa *cnfa,
         fprintf(f, ", eol [%ld]", (long) cnfa->eos[1]);
     if (cnfa->flags & HASLACONS)
         fprintf(f, ", haslacons");
+    if (cnfa->flags & MATCHALL)
+        fprintf(f, ", minmatchall %d, maxmatchall %d",
+                cnfa->minmatchall, cnfa->maxmatchall);
     fprintf(f, "\n");
     for (st = 0; st < cnfa->nstates; st++)
         dumpcstate(st, cnfa, f);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 5956b86026..17e79ad0fb 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -175,6 +175,11 @@ static void cleanup(struct nfa *);
 static void markreachable(struct nfa *, struct state *, struct state *, struct state *);
 static void markcanreach(struct nfa *, struct state *, struct state *, struct state *);
 static long analyze(struct nfa *);
+static void checkmatchall(struct nfa *);
+static bool checkmatchall_recurse(struct nfa *, struct state *,
+                                  bool, int, bool *);
+static bool check_out_colors_match(struct state *, color, color);
+static bool check_in_colors_match(struct state *, color, color);
 static void compact(struct nfa *, struct cnfa *);
 static void carcsort(struct carc *, size_t);
 static int    carc_cmp(const void *, const void *);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 32be2592c5..20ec463204 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -58,6 +58,29 @@ longest(struct vars *v,
     if (hitstopp != NULL)
         *hitstopp = 0;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = stop - start;
+        size_t        maxmatchall = d->cnfa->maxmatchall;
+
+        if (nchr < d->cnfa->minmatchall)
+            return NULL;
+        if (maxmatchall == DUPINF)
+        {
+            if (stop == v->stop && hitstopp != NULL)
+                *hitstopp = 1;
+        }
+        else
+        {
+            if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
+                *hitstopp = 1;
+            if (nchr > maxmatchall)
+                return start + maxmatchall;
+        }
+        return stop;
+    }
+
     /* initialize */
     css = initialize(v, d, start);
     if (css == NULL)
@@ -187,6 +210,24 @@ shortest(struct vars *v,
     if (hitstopp != NULL)
         *hitstopp = 0;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = min - start;
+
+        if (d->cnfa->maxmatchall != DUPINF &&
+            nchr > d->cnfa->maxmatchall)
+            return NULL;
+        if ((max - start) < d->cnfa->minmatchall)
+            return NULL;
+        if (nchr < d->cnfa->minmatchall)
+            min = start + d->cnfa->minmatchall;
+        if (coldp != NULL)
+            *coldp = start;
+        /* there is no case where we should set *hitstopp */
+        return min;
+    }
+
     /* initialize */
     css = initialize(v, d, start);
     if (css == NULL)
@@ -312,6 +353,22 @@ matchuntil(struct vars *v,
     struct sset *ss;
     struct colormap *cm = d->cm;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = probe - v->start;
+
+        /*
+         * It might seem that we should check maxmatchall too, but the
+         * implicit .* at the front of the pattern absorbs any extra
+         * characters.  Thus, we should always match as long as there are at
+         * least minmatchall characters.
+         */
+        if (nchr < d->cnfa->minmatchall)
+            return 0;
+        return 1;
+    }
+
     /* initialize and startup, or restart, if necessary */
     if (cp == NULL || cp > probe)
     {
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index e2fbad7a8a..ec435b6f5f 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -77,6 +77,10 @@ pg_regprefix(regex_t *re,
     assert(g->tree != NULL);
     cnfa = &g->tree->cnfa;

+    /* matchall NFAs never have a fixed prefix */
+    if (cnfa->flags & MATCHALL)
+        return REG_NOMATCH;
+
     /*
      * Since a correct NFA should never contain any exit-free loops, it should
      * not be possible for our traversal to return to a previously visited NFA
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 5bcd669d59..956b37b72d 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -331,6 +331,9 @@ struct nfa
     struct colormap *cm;        /* the color map */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
     color        eos[2];            /* colors, if any, assigned to EOS and EOL */
+    int            flags;            /* flags to pass forward to cNFA */
+    int            minmatchall;    /* min number of chrs to match, if matchall */
+    int            maxmatchall;    /* max number of chrs to match, or DUPINF */
     struct vars *v;                /* simplifies compile error reporting */
     struct nfa *parent;            /* parent NFA, if any */
 };
@@ -353,6 +356,14 @@ struct nfa
  *
  * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
  * it doesn't break the rule about how to recognize LACON arcs.
+ *
+ * We have special markings for "trivial" NFAs that can match any string
+ * (possibly with limits on the number of characters therein).  In such a
+ * case, flags & MATCHALL is set (and HASLACONS can't be set).  Then the
+ * fields minmatchall and maxmatchall give the minimum and maximum numbers
+ * of characters to match.  For example, ".*" produces minmatchall = 0
+ * and maxmatchall = DUPINF, while ".+" produces minmatchall = 1 and
+ * maxmatchall = DUPINF.
  */
 struct carc
 {
@@ -366,6 +377,7 @@ struct cnfa
     int            ncolors;        /* number of colors (max color in use + 1) */
     int            flags;
 #define  HASLACONS    01            /* uses lookaround constraints */
+#define  MATCHALL    02            /* matches all strings of a range of lengths */
     int            pre;            /* setup state number */
     int            post;            /* teardown state number */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
@@ -375,6 +387,9 @@ struct cnfa
     struct carc **states;        /* vector of pointers to outarc lists */
     /* states[n] are pointers into a single malloc'd array of arcs */
     struct carc *arcs;            /* the area for the lists */
+    /* these fields are used only in a MATCHALL NFA (else they're -1): */
+    int            minmatchall;    /* min number of chrs to match */
+    int            maxmatchall;    /* max number of chrs to match, or DUPINF */
 };

 #define ZAPCNFA(cnfa)    ((cnfa).nstates = 0)
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 0dc2265d8b..f01ca071d9 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3315,6 +3315,21 @@ select * from test_regex('(?=b)b', 'a', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)

+-- expectMatch    23.9 HP        ...(?!.)    abcde    cde
+select * from test_regex('...(?!.)', 'abcde', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {cde}
+(2 rows)
+
+-- expectNomatch    23.10 HP    ...(?=.)    abc
+select * from test_regex('...(?=.)', 'abc', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+(1 row)
+
 -- Postgres addition: lookbehind constraints
 -- expectMatch    23.11 HPN        (?<=a)b*    ab    b
 select * from test_regex('(?<=a)b*', 'ab', 'HPN');
@@ -3376,6 +3391,39 @@ select * from test_regex('(?<=b)b', 'b', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)

+-- expectMatch    23.19 HP        (?<=.)..    abcde    bc
+select * from test_regex('(?<=.)..', 'abcde', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bc}
+(2 rows)
+
+-- expectMatch    23.20 HP        (?<=..)a*    aaabb    a
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {a}
+(2 rows)
+
+-- expectMatch    23.21 HP        (?<=..)b*    aaabb    {}
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {""}
+(2 rows)
+
+-- expectMatch    23.22 HP        (?<=..)b+    aaabb    bb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bb}
+(2 rows)
+
 -- doing 24 "non-greedy quantifiers"
 -- expectMatch    24.1  PT    ab+?        abb    ab
 select * from test_regex('ab+?', 'abb', 'PT');
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 1a2bfa6235..ae7d6b43e4 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1049,6 +1049,10 @@ select * from test_regex('a(?!b)b*', 'a', 'HP');
 select * from test_regex('(?=b)b', 'b', 'HP');
 -- expectNomatch    23.8 HP        (?=b)b        a
 select * from test_regex('(?=b)b', 'a', 'HP');
+-- expectMatch    23.9 HP        ...(?!.)    abcde    cde
+select * from test_regex('...(?!.)', 'abcde', 'HP');
+-- expectNomatch    23.10 HP    ...(?=.)    abc
+select * from test_regex('...(?=.)', 'abc', 'HP');

 -- Postgres addition: lookbehind constraints

@@ -1068,6 +1072,15 @@ select * from test_regex('a(?<!b)b*', 'a', 'HP');
 select * from test_regex('(?<=b)b', 'bb', 'HP');
 -- expectNomatch    23.18 HP        (?<=b)b        b
 select * from test_regex('(?<=b)b', 'b', 'HP');
+-- expectMatch    23.19 HP        (?<=.)..    abcde    bc
+select * from test_regex('(?<=.)..', 'abcde', 'HP');
+-- expectMatch    23.20 HP        (?<=..)a*    aaabb    a
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+-- expectMatch    23.21 HP        (?<=..)b*    aaabb    {}
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+-- expectMatch    23.22 HP        (?<=..)b+    aaabb    bb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');

 -- doing 24 "non-greedy quantifiers"

Re: Some regular-expression performance hacking

От

"Joel Jacobson"

Дата:

14 февраля 2021 г., 12:52:55

On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote:

>0001-invent-rainbow-arcs-2.patch

>0002-recognize-matchall-NFAs-2.patch

I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset.

Out of the 1489489 (pattern, text string) pairs tested,

there was only one single deviation:

This 100577 bytes big regex (pattern_id = 207811)...

...previously raised...

error invalid regular expression: regular expression is too complex

...but now goes through:

is_match <NULL> => t

captured <NULL> => {de}

error invalid regular expression: regular expression is too complex => <NULL>

Nice. The patched regex engine is apparently capable of handling even more complex regexes than before.

The test that found the deviation tests each (pattern, text string) individually,

to catch errors. But I've also made a separate query to just test regexes

known to not cause errors, to allow testing all regexes in one big query,

which fully utilizes the CPU cores and also runs quicker.

Below is the result of the performance test query:

\timing

SELECT

tests.is_match IS NOT DISTINCT FROM (subjects.subject ~ patterns.pattern),

tests.captured IS NOT DISTINCT FROM regexp_match(subjects.subject, patterns.pattern),

COUNT(*)

FROM tests

JOIN subjects ON subjects.subject_id = tests.subject_id

JOIN patterns ON patterns.pattern_id = subjects.pattern_id

JOIN server_versions ON server_versions.server_version_num = tests.server_version_num

WHERE server_versions.server_version = current_setting('server_version')

AND tests.error IS NULL

GROUP BY 1,2

ORDER BY 1,2;

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:

?column? | ?column? | count

----------+----------+---------

t | t | 1448212

(1 row)

Time: 592196.145 ms (09:52.196)

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:

?column? | ?column? | count

----------+----------+---------

t | t | 1448212

(1 row)

Time: 461739.364 ms (07:41.739)

That's an impressive 22% speed-up!

/Joel

Re: Some regular-expression performance hacking

От

Tom Lane

Дата:

14 февраля 2021 г., 16:45:40

"Joel Jacobson" <joel@compiler.org> writes:
> I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset.
> Out of the 1489489 (pattern, text string) pairs tested,
> there was only one single deviation:
> This 100577 bytes big regex (pattern_id = 207811)...
> ...
> ...previously raised...
>     error invalid regular expression: regular expression is too complex
> ...but now goes through:

> Nice. The patched regex engine is apparently capable of handling even more complex regexes than before.

Yeah.  There are various limitations that can lead to REG_ETOOBIG, but the
main ones are "too many states" and "too many arcs".  The RAINBOW change
directly reduces the number of arcs and thus makes larger regexes feasible.
I'm sure it's coincidental that the one such example you captured happens
to be fixed by this change, but hey I'll take it.

            regards, tom lane

Re: Some regular-expression performance hacking

От

Tom Lane

Дата:

15 февраля 2021 г., 03:11:37

"Joel Jacobson" <joel@compiler.org> writes:
> Below is the result of the performance test query:
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:
> Time: 592196.145 ms (09:52.196)
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:
> Time: 461739.364 ms (07:41.739)
> That's an impressive 22% speed-up!

I've been doing some more hacking over the weekend, and have a couple
of additional improvements to show.  The point of these two additional
patches is to reduce the number of "struct subre" sub-regexps that
the regex parser creates.  The subre's themselves aren't that large,
so this might seem like it would have only small benefit.  However,
each subre requires its own NFA for the portion of the RE that it
matches.  That adds space, and it also adds compilation time because
we run the "optimize()" pass separately for each such NFA.  Maybe
there'd be a way to share some of that work, but I'm not very clear
how.  In any case, not having a subre at all is clearly better where
we can manage it.

0003 is a small patch that fixes up parseqatom() so that it doesn't
emit no-op subre's for empty portions of a regexp branch that are
adjacent to a "messy" regexp atom (that is, a capture node, a
backref, or an atom with greediness different from what preceded it).

0004 is a rather larger patch whose result is to get rid of extra
subre's associated with alternation subre's.  If we have a|b|c
and any of those alternation branches are messy, we end up with

      *
     / \
    a   *
       / \
      b   *
         / \
        c   NULL

where each "*" is an alternation subre node, and all those "*"'s have
identical NFAs that match the whole a|b|c construct.  This means that
for an N-way alternation we're going to need something like O(N^2)
work to optimize all those NFAs.  That's embarrassing (and I think
it's my fault --- if memory serves, I put in this representation
of messy alternations years ago).

We can improve matters by having just one parent node for an
alternation:

    *
     \
      a -> b -> c

That requires replacing the binary-tree structure of subre's
with a child-and-sibling arrangement, which is not terribly
difficult but accounts for most of the bulk of the patch.
(I'd wanted to do that for years, but up till now I did not
think it would have any real material benefit.)

There might be more that can be done in this line, but that's
as far as I got so far.

I did some testing on this using your dataset (thanks for
giving me a copy) and this query:

SELECT
  pattern,
  subject,
  is_match AS is_match_head,
  captured AS captured_head,
  subject ~ pattern AS is_match_patch,
  regexp_match(subject, pattern) AS captured_patch
FROM subjects
WHERE error IS NULL
AND (is_match <> (subject ~ pattern)
     OR captured IS DISTINCT FROM regexp_match(subject, pattern));

I got these runtimes (non-cassert builds):

HEAD    313661.149 ms (05:13.661)
+0001    297397.293 ms (04:57.397)    5% better than HEAD
+0002    151995.803 ms (02:31.996)    51% better than HEAD
+0003    139843.934 ms (02:19.844)    55% better than HEAD
+0004    95034.611 ms (01:35.035)    69% better than HEAD

Since I don't have all the tables used in your query, I can't
try to reproduce your results exactly.  I suspect the reason
I'm getting a better percentage improvement than you did is
that the joining/grouping/ordering involved in your query
creates a higher baseline query cost.

Anyway, I'm feeling pretty pleased with these results ...

            regards, tom lane

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1e4f0121f3..fcf03de32d 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -282,8 +282,8 @@ typedef struct
 typedef int TrgmColor;

 /* We assume that colors returned by the regexp engine cannot be these: */
-#define COLOR_UNKNOWN    (-1)
-#define COLOR_BLANK        (-2)
+#define COLOR_UNKNOWN    (-3)
+#define COLOR_BLANK        (-4)

 typedef struct
 {
@@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
         palloc0(colorsCount * sizeof(TrgmColorInfo));

     /*
-     * Loop over colors, filling TrgmColorInfo about each.
+     * Loop over colors, filling TrgmColorInfo about each.  Note we include
+     * WHITE (0) even though we know it'll be reported as non-expandable.
      */
     for (i = 0; i < colorsCount; i++)
     {
@@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
             /* Add enter key to this state */
             addKeyToQueue(trgmNFA, &destKey);
         }
-        else
+        else if (arc->co >= 0)
         {
-            /* Regular color */
+            /* Regular color (including WHITE) */
             TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];

             if (colorInfo->expandable)
@@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
                 addKeyToQueue(trgmNFA, &destKey);
             }
         }
+        else
+        {
+            /* RAINBOW: treat as unexpandable color */
+            destKey.prefix.colors[0] = COLOR_UNKNOWN;
+            destKey.prefix.colors[1] = COLOR_UNKNOWN;
+            destKey.nstate = arc->to;
+            addKeyToQueue(trgmNFA, &destKey);
+        }
     }

     pfree(arcs);
@@ -1216,10 +1225,10 @@ addArcs(TrgmNFA *trgmNFA, TrgmState *state)
             /*
              * Ignore non-expandable colors; addKey already handled the case.
              *
-             * We need no special check for begin/end pseudocolors here.  We
-             * don't need to do any processing for them, and they will be
-             * marked non-expandable since the regex engine will have reported
-             * them that way.
+             * We need no special check for WHITE or begin/end pseudocolors
+             * here.  We don't need to do any processing for them, and they
+             * will be marked non-expandable since the regex engine will have
+             * reported them that way.
              */
             if (!colorInfo->expandable)
                 continue;
diff --git a/src/backend/regex/README b/src/backend/regex/README
index f08aab69e3..cc1834b89c 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -261,6 +261,18 @@ and the NFA has these arcs:
     states 4 -> 5 on color 2 ("x" only)
 which can be seen to be a correct representation of the regex.

+There is one more complexity, which is how to handle ".", that is a
+match-anything atom.  We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot.  That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic.  Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors.  Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline.  In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
 Given this summary, we can see we need the following operations for
 colors:

@@ -349,6 +361,8 @@ The possible arc types are:

     PLAIN arcs, which specify matching of any character of a given "color"
     (see above).  These are dumped as "[color_number]->to_state".
+    In addition there can be "rainbow" PLAIN arcs, which are dumped as
+    "[*]->to_state".

     EMPTY arcs, which specify a no-op transition to another state.  These
     are dumped as "->to_state".
@@ -356,11 +370,11 @@ The possible arc types are:
     AHEAD constraints, which represent a "next character must be of this
     color" constraint.  AHEAD differs from a PLAIN arc in that the input
     character is not consumed when crossing the arc.  These are dumped as
-    ">color_number>->to_state".
+    ">color_number>->to_state", or possibly ">*>->to_state".

     BEHIND constraints, which represent a "previous character must be of
     this color" constraint, which likewise consumes no input.  These are
-    dumped as "<color_number<->to_state".
+    dumped as "<color_number<->to_state", or possibly "<*<->to_state".

     '^' arcs, which specify a beginning-of-input constraint.  These are
     dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index f5a4151757..0864011cce 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -977,6 +977,7 @@ colorchain(struct colormap *cm,
 {
     struct colordesc *cd = &cm->cd[a->co];

+    assert(a->co >= 0);
     if (cd->arcs != NULL)
         cd->arcs->colorchainRev = a;
     a->colorchain = cd->arcs;
@@ -994,6 +995,7 @@ uncolorchain(struct colormap *cm,
     struct colordesc *cd = &cm->cd[a->co];
     struct arc *aa = a->colorchainRev;

+    assert(a->co >= 0);
     if (aa == NULL)
     {
         assert(cd->arcs == a);
@@ -1012,6 +1014,9 @@ uncolorchain(struct colormap *cm,

 /*
  * rainbow - add arcs of all full colors (but one) between specified states
+ *
+ * If there isn't an exception color, we now generate just a single arc
+ * labeled RAINBOW, saving lots of arc-munging later on.
  */
 static void
 rainbow(struct nfa *nfa,
@@ -1025,6 +1030,13 @@ rainbow(struct nfa *nfa,
     struct colordesc *end = CDEND(cm);
     color        co;

+    if (but == COLORLESS)
+    {
+        newarc(nfa, type, RAINBOW, from, to);
+        return;
+    }
+
+    /* Gotta do it the hard way.  Skip subcolors, pseudocolors, and "but" */
     for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
         if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
             !(cd->flags & PSEUDO))
@@ -1034,13 +1046,16 @@ rainbow(struct nfa *nfa,
 /*
  * colorcomplement - add arcs of complementary colors
  *
+ * We add arcs of all colors that are not pseudocolors and do not match
+ * any of the "of" state's PLAIN outarcs.
+ *
  * The calling sequence ought to be reconciled with cloneouts().
  */
 static void
 colorcomplement(struct nfa *nfa,
                 struct colormap *cm,
                 int type,
-                struct state *of,    /* complements of this guy's PLAIN outarcs */
+                struct state *of,
                 struct state *from,
                 struct state *to)
 {
@@ -1049,6 +1064,11 @@ colorcomplement(struct nfa *nfa,
     color        co;

     assert(of != from);
+
+    /* A RAINBOW arc matches all colors, making the complement empty */
+    if (findarc(of, PLAIN, RAINBOW) != NULL)
+        return;
+
     for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
         if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
             if (findarc(of, PLAIN, co) == NULL)
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index cde82625c8..ff98bfd694 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -271,6 +271,11 @@ destroystate(struct nfa *nfa,
  *
  * This function checks to make sure that no duplicate arcs are created.
  * In general we never want duplicates.
+ *
+ * However: in principle, a RAINBOW arc is redundant with any plain arc
+ * (unless that arc is for a pseudocolor).  But we don't try to recognize
+ * that redundancy, either here or in allied operations such as moveins().
+ * The pseudocolor consideration makes that more costly than it seems worth.
  */
 static void
 newarc(struct nfa *nfa,
@@ -1170,6 +1175,9 @@ copyouts(struct nfa *nfa,

 /*
  * cloneouts - copy out arcs of a state to another state pair, modifying type
+ *
+ * This is only used to convert PLAIN arcs to AHEAD/BEHIND arcs, which share
+ * the same interpretation of "co".  It wouldn't be sensible with LACONs.
  */
 static void
 cloneouts(struct nfa *nfa,
@@ -1181,9 +1189,13 @@ cloneouts(struct nfa *nfa,
     struct arc *a;

     assert(old != from);
+    assert(type == AHEAD || type == BEHIND);

     for (a = old->outs; a != NULL; a = a->outchain)
+    {
+        assert(a->type == PLAIN);
         newarc(nfa, type, a->co, from, to);
+    }
 }

 /*
@@ -1597,7 +1609,7 @@ pull(struct nfa *nfa,
     for (a = from->ins; a != NULL && !NISERR(); a = nexta)
     {
         nexta = a->inchain;
-        switch (combine(con, a))
+        switch (combine(nfa, con, a))
         {
             case INCOMPATIBLE:    /* destroy the arc */
                 freearc(nfa, a);
@@ -1624,6 +1636,10 @@ pull(struct nfa *nfa,
                 cparc(nfa, a, s, to);
                 freearc(nfa, a);
                 break;
+            case REPLACEARC:    /* replace arc's color */
+                newarc(nfa, a->type, con->co, a->from, to);
+                freearc(nfa, a);
+                break;
             default:
                 assert(NOTREACHED);
                 break;
@@ -1764,7 +1780,7 @@ push(struct nfa *nfa,
     for (a = to->outs; a != NULL && !NISERR(); a = nexta)
     {
         nexta = a->outchain;
-        switch (combine(con, a))
+        switch (combine(nfa, con, a))
         {
             case INCOMPATIBLE:    /* destroy the arc */
                 freearc(nfa, a);
@@ -1791,6 +1807,10 @@ push(struct nfa *nfa,
                 cparc(nfa, a, from, s);
                 freearc(nfa, a);
                 break;
+            case REPLACEARC:    /* replace arc's color */
+                newarc(nfa, a->type, con->co, from, a->to);
+                freearc(nfa, a);
+                break;
             default:
                 assert(NOTREACHED);
                 break;
@@ -1810,9 +1830,11 @@ push(struct nfa *nfa,
  * #def INCOMPATIBLE    1    // destroys arc
  * #def SATISFIED        2    // constraint satisfied
  * #def COMPATIBLE        3    // compatible but not satisfied yet
+ * #def REPLACEARC        4    // replace arc's color with constraint color
  */
 static int
-combine(struct arc *con,
+combine(struct nfa *nfa,
+        struct arc *con,
         struct arc *a)
 {
 #define  CA(ct,at)     (((ct)<<CHAR_BIT) | (at))
@@ -1827,14 +1849,46 @@ combine(struct arc *con,
         case CA(BEHIND, PLAIN):
             if (con->co == a->co)
                 return SATISFIED;
+            if (con->co == RAINBOW)
+            {
+                /* con is satisfied unless arc's color is a pseudocolor */
+                if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+                    return SATISFIED;
+            }
+            else if (a->co == RAINBOW)
+            {
+                /* con is incompatible if it's for a pseudocolor */
+                if (nfa->cm->cd[con->co].flags & PSEUDO)
+                    return INCOMPATIBLE;
+                /* otherwise, constraint constrains arc to be only its color */
+                return REPLACEARC;
+            }
             return INCOMPATIBLE;
             break;
         case CA('^', '^'):        /* collision, similar constraints */
         case CA('$', '$'):
-        case CA(AHEAD, AHEAD):
+            if (con->co == a->co)    /* true duplication */
+                return SATISFIED;
+            return INCOMPATIBLE;
+            break;
+        case CA(AHEAD, AHEAD):    /* collision, similar constraints */
         case CA(BEHIND, BEHIND):
             if (con->co == a->co)    /* true duplication */
                 return SATISFIED;
+            if (con->co == RAINBOW)
+            {
+                /* con is satisfied unless arc's color is a pseudocolor */
+                if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+                    return SATISFIED;
+            }
+            else if (a->co == RAINBOW)
+            {
+                /* con is incompatible if it's for a pseudocolor */
+                if (nfa->cm->cd[con->co].flags & PSEUDO)
+                    return INCOMPATIBLE;
+                /* otherwise, constraint constrains arc to be only its color */
+                return REPLACEARC;
+            }
             return INCOMPATIBLE;
             break;
         case CA('^', BEHIND):    /* collision, dissimilar constraints */
@@ -2895,6 +2949,7 @@ compact(struct nfa *nfa,
                     break;
                 case LACON:
                     assert(s->no != cnfa->pre);
+                    assert(a->co >= 0);
                     ca->co = (color) (cnfa->ncolors + a->co);
                     ca->to = a->to->no;
                     ca++;
@@ -2902,7 +2957,7 @@ compact(struct nfa *nfa,
                     break;
                 default:
                     NERR(REG_ASSERT);
-                    break;
+                    return;
             }
         carcsort(first, ca - first);
         ca->co = COLORLESS;
@@ -3068,13 +3123,22 @@ dumparc(struct arc *a,
     switch (a->type)
     {
         case PLAIN:
-            fprintf(f, "[%ld]", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, "[*]");
+            else
+                fprintf(f, "[%ld]", (long) a->co);
             break;
         case AHEAD:
-            fprintf(f, ">%ld>", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, ">*>");
+            else
+                fprintf(f, ">%ld>", (long) a->co);
             break;
         case BEHIND:
-            fprintf(f, "<%ld<", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, "<*<");
+            else
+                fprintf(f, "<%ld<", (long) a->co);
             break;
         case LACON:
             fprintf(f, ":%ld:", (long) a->co);
@@ -3161,7 +3225,9 @@ dumpcstate(int st,
     pos = 1;
     for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
     {
-        if (ca->co < cnfa->ncolors)
+        if (ca->co == RAINBOW)
+            fprintf(f, "\t[*]->%d", ca->to);
+        else if (ca->co < cnfa->ncolors)
             fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
         else
             fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index f0896d2db1..e73476040d 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -158,7 +158,8 @@ static int    push(struct nfa *, struct arc *, struct state **);
 #define INCOMPATIBLE    1        /* destroys arc */
 #define SATISFIED    2            /* constraint satisfied */
 #define COMPATIBLE    3            /* compatible but not satisfied yet */
-static int    combine(struct arc *, struct arc *);
+#define REPLACEARC    4            /* replace arc's color with constraint color */
+static int    combine(struct nfa *nfa, struct arc *con, struct arc *a);
 static void fixempties(struct nfa *, FILE *);
 static struct state *emptyreachable(struct nfa *, struct state *,
                                     struct state *, struct arc **);
@@ -289,9 +290,11 @@ struct vars
 #define SBEGIN    'A'                /* beginning of string (even if not BOL) */
 #define SEND    'Z'                /* end of string (even if not EOL) */

-/* is an arc colored, and hence on a color chain? */
+/* is an arc colored, and hence should belong to a color chain? */
+/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
 #define COLORED(a) \
-    ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)
+    ((a)->co >= 0 && \
+     ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))


 /* static function list */
@@ -1393,7 +1396,8 @@ bracket(struct vars *v,
  * cbracket - handle complemented bracket expression
  * We do it by calling bracket() with dummy endpoints, and then complementing
  * the result.  The alternative would be to invoke rainbow(), and then delete
- * arcs as the b.e. is seen... but that gets messy.
+ * arcs as the b.e. is seen... but that gets messy, and is really quite
+ * infeasible now that rainbow() just puts out one RAINBOW arc.
  */
 static void
 cbracket(struct vars *v,
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 5695e158a5..32be2592c5 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -612,6 +612,7 @@ miss(struct vars *v,
     unsigned    h;
     struct carc *ca;
     struct sset *p;
+    int            ispseudocolor;
     int            ispost;
     int            noprogress;
     int            gotstate;
@@ -643,13 +644,15 @@ miss(struct vars *v,
      */
     for (i = 0; i < d->wordsper; i++)
         d->work[i] = 0;            /* build new stateset bitmap in d->work */
+    ispseudocolor = d->cm->cd[co].flags & PSEUDO;
     ispost = 0;
     noprogress = 1;
     gotstate = 0;
     for (i = 0; i < d->nstates; i++)
         if (ISBSET(css->states, i))
             for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
-                if (ca->co == co)
+                if (ca->co == co ||
+                    (ca->co == RAINBOW && !ispseudocolor))
                 {
                     BSET(d->work, ca->to);
                     gotstate = 1;
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index d4f940b8c3..a493dbe88c 100644
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -222,7 +222,8 @@ pg_reg_colorisend(const regex_t *regex, int co)
  * Get number of member chrs of color number "co".
  *
  * Note: we return -1 if the color number is invalid, or if it is a special
- * color (WHITE or a pseudocolor), or if the number of members is uncertain.
+ * color (WHITE, RAINBOW, or a pseudocolor), or if the number of members is
+ * uncertain.
  * Callers should not try to extract the members if -1 is returned.
  */
 int
@@ -233,7 +234,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
     assert(regex != NULL && regex->re_magic == REMAGIC);
     cm = &((struct guts *) regex->re_guts)->cmap;

-    if (co <= 0 || co > cm->max)    /* we reject 0 which is WHITE */
+    if (co <= 0 || co > cm->max)    /* <= 0 rejects WHITE and RAINBOW */
         return -1;
     if (cm->cd[co].flags & PSEUDO)    /* also pseudocolors (BOS etc) */
         return -1;
@@ -257,7 +258,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
  * whose length chars_len must be at least as long as indicated by
  * pg_reg_getnumcharacters(), else not all chars will be returned.
  *
- * Fetching the members of WHITE or a pseudocolor is not supported.
+ * Fetching the members of WHITE, RAINBOW, or a pseudocolor is not supported.
  *
  * Caution: this is a relatively expensive operation.
  */
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 1d4593ac94..e2fbad7a8a 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -165,9 +165,13 @@ findprefix(struct cnfa *cnfa,
             /* We can ignore BOS/BOL arcs */
             if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
                 continue;
-            /* ... but EOS/EOL arcs terminate the search, as do LACONs */
+
+            /*
+             * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs
+             * and LACONs
+             */
             if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] ||
-                ca->co >= cnfa->ncolors)
+                ca->co == RAINBOW || ca->co >= cnfa->ncolors)
             {
                 thiscolor = COLORLESS;
                 break;
diff --git a/src/include/regex/regexport.h b/src/include/regex/regexport.h
index e6209463f7..99c4fb854e 100644
--- a/src/include/regex/regexport.h
+++ b/src/include/regex/regexport.h
@@ -30,6 +30,10 @@

 #include "regex/regex.h"

+/* These macros must match corresponding ones in regguts.h: */
+#define COLOR_WHITE        0        /* color for chars not appearing in regex */
+#define COLOR_RAINBOW    (-2)    /* represents all colors except pseudocolors */
+
 /* information about one arc of a regex's NFA */
 typedef struct
 {
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 0a616562d0..6d39108319 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -130,11 +130,16 @@
 /*
  * As soon as possible, we map chrs into equivalence classes -- "colors" --
  * which are of much more manageable number.
+ *
+ * To further reduce the number of arcs in NFAs and DFAs, we also have a
+ * special RAINBOW "color" that can be assigned to an arc.  This is not a
+ * real color, in that it has no entry in color maps.
  */
 typedef short color;            /* colors of characters */

 #define MAX_COLOR    32767        /* max color (must fit in 'color' datatype) */
 #define COLORLESS    (-1)        /* impossible color */
+#define RAINBOW        (-2)        /* represents all colors except pseudocolors */
 #define WHITE        0            /* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */

@@ -276,7 +281,7 @@ struct state;
 struct arc
 {
     int            type;            /* 0 if free, else an NFA arc type code */
-    color        co;
+    color        co;                /* color the arc matches (possibly RAINBOW) */
     struct state *from;            /* where it's from (and contained within) */
     struct state *to;            /* where it's to */
     struct arc *outchain;        /* link in *from's outs chain or free chain */
@@ -284,6 +289,7 @@ struct arc
 #define  freechain    outchain    /* we do not maintain "freechainRev" */
     struct arc *inchain;        /* link in *to's ins chain */
     struct arc *inchainRev;        /* back-link in *to's ins chain */
+    /* these fields are not used when co == RAINBOW: */
     struct arc *colorchain;        /* link in color's arc chain */
     struct arc *colorchainRev;    /* back-link in color's arc chain */
 };
@@ -344,6 +350,9 @@ struct nfa
  * Plain arcs just store the transition color number as "co".  LACON arcs
  * store the lookaround constraint number plus cnfa.ncolors as "co".  LACON
  * arcs can be distinguished from plain by testing for co >= cnfa.ncolors.
+ *
+ * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
+ * it doesn't break the rule about how to recognize LACON arcs.
  */
 struct carc
 {
diff --git a/src/backend/regex/README b/src/backend/regex/README
index cc1834b89c..a83ab5074d 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -410,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
 the end of the input.
 3. If the NFA is (or can be) in the goal state at this point, it matches.

+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any.  The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
 So one can mentally execute an untransformed NFA by taking ^ and $ as
 ordinary constraints that match at start and end of input; but plain
 arcs out of the start state should be taken as matches for the character
 before the target substring, and similarly, plain arcs leading to the
 post state are matches for the character after the target substring.
-This definition is necessary to support regexes that begin or end with
-constraints such as \m and \M, which imply requirements on the adjacent
-character if any.  NFAs for simple unanchored patterns will usually have
-pre-state outarcs for all possible character colors as well as BOS and
-BOL, and post-state inarcs for all possible character colors as well as
-EOS and EOL, so that the executor's behavior will work.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index ff98bfd694..b403bb250d 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -65,6 +65,8 @@ newnfa(struct vars *v,
     nfa->v = v;
     nfa->bos[0] = nfa->bos[1] = COLORLESS;
     nfa->eos[0] = nfa->eos[1] = COLORLESS;
+    nfa->flags = 0;
+    nfa->minmatchall = nfa->maxmatchall = -1;
     nfa->parent = parent;        /* Precedes newfstate so parent is valid. */
     nfa->post = newfstate(nfa, '@');    /* number 0 */
     nfa->pre = newfstate(nfa, '>'); /* number 1 */
@@ -2875,8 +2877,14 @@ analyze(struct nfa *nfa)
     if (NISERR())
         return 0;

+    /* Detect whether NFA can't match anything */
     if (nfa->pre->outs == NULL)
         return REG_UIMPOSSIBLE;
+
+    /* Detect whether NFA matches all strings (possibly with length bounds) */
+    checkmatchall(nfa);
+
+    /* Detect whether NFA can possibly match a zero-length string */
     for (a = nfa->pre->outs; a != NULL; a = a->outchain)
         for (aa = a->to->outs; aa != NULL; aa = aa->outchain)
             if (aa->to == nfa->post)
@@ -2884,6 +2892,279 @@ analyze(struct nfa *nfa)
     return 0;
 }

+/*
+ * checkmatchall - does the NFA represent no more than a string length test?
+ *
+ * If so, set nfa->minmatchall and nfa->maxmatchall correctly (they are -1
+ * to begin with) and set the MATCHALL bit in nfa->flags.
+ *
+ * To succeed, we require all arcs to be PLAIN RAINBOW arcs, except for those
+ * for pseudocolors (i.e., BOS/BOL/EOS/EOL).  We must be able to reach the
+ * post state via RAINBOW arcs, and if there are any loops in the graph, they
+ * must be loop-to-self arcs, ensuring that each loop iteration consumes
+ * exactly one character.  (Longer loops are problematic because they create
+ * non-consecutive possible match lengths; we have no good way to represent
+ * that situation for lengths beyond the DUPINF limit.)
+ *
+ * Pseudocolor arcs complicate things a little.  We know that they can only
+ * appear as pre-state outarcs (for BOS/BOL) or post-state inarcs (for
+ * EOS/EOL).  There, they must exactly replicate the parallel RAINBOW arcs,
+ * e.g. if the pre state has one RAINBOW outarc to state 2, it must have BOS
+ * and BOL outarcs to state 2, and no others.  Missing or extra pseudocolor
+ * arcs can occur, meaning that the NFA involves some constraint on the
+ * adjacent characters, which makes it not a matchall NFA.
+ */
+static void
+checkmatchall(struct nfa *nfa)
+{
+    bool        hasmatch[DUPINF + 1];
+    int            minmatch,
+                maxmatch,
+                morematch;
+
+    /*
+     * hasmatch[i] will be set true if a match of length i is feasible, for i
+     * from 0 to DUPINF-1.  hasmatch[DUPINF] will be set true if every match
+     * length of DUPINF or more is feasible.
+     */
+    memset(hasmatch, 0, sizeof(hasmatch));
+
+    /*
+     * Recursively search the graph for all-RAINBOW paths to the "post" state,
+     * starting at the "pre" state.  The -1 initial depth accounts for the
+     * fact that transitions out of the "pre" state are not part of the
+     * matched string.  We likewise don't count the final transition to the
+     * "post" state as part of the match length.  (But we still insist that
+     * those transitions have RAINBOW arcs, otherwise there are lookbehind or
+     * lookahead constraints at the start/end of the pattern.)
+     */
+    if (!checkmatchall_recurse(nfa, nfa->pre, false, -1, hasmatch))
+        return;
+
+    /*
+     * We found some all-RAINBOW paths, and not anything that we couldn't
+     * handle.  Now verify that pseudocolor arcs adjacent to the pre and post
+     * states match the RAINBOW arcs there.  (We could do this while
+     * recursing, but it's expensive and unlikely to fail, so do it last.)
+     */
+    if (!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[0]) ||
+        !check_out_colors_match(nfa->pre, nfa->bos[0], RAINBOW) ||
+        !check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[1]) ||
+        !check_out_colors_match(nfa->pre, nfa->bos[1], RAINBOW))
+        return;
+    if (!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[0]) ||
+        !check_in_colors_match(nfa->post, nfa->eos[0], RAINBOW) ||
+        !check_in_colors_match(nfa->post, RAINBOW, nfa->eos[1]) ||
+        !check_in_colors_match(nfa->post, nfa->eos[1], RAINBOW))
+        return;
+
+    /*
+     * hasmatch[] now represents the set of possible match lengths; but we
+     * want to reduce that to a min and max value, because it doesn't seem
+     * worth complicating regexec.c to deal with nonconsecutive possible match
+     * lengths.  Find min and max of first run of lengths, then verify there
+     * are no nonconsecutive lengths.
+     */
+    for (minmatch = 0; minmatch <= DUPINF; minmatch++)
+    {
+        if (hasmatch[minmatch])
+            break;
+    }
+    assert(minmatch <= DUPINF); /* else checkmatchall_recurse lied */
+    for (maxmatch = minmatch; maxmatch < DUPINF; maxmatch++)
+    {
+        if (!hasmatch[maxmatch + 1])
+            break;
+    }
+    for (morematch = maxmatch + 1; morematch <= DUPINF; morematch++)
+    {
+        if (hasmatch[morematch])
+            return;                /* fail, there are nonconsecutive lengths */
+    }
+
+    /* Success, so record the info */
+    nfa->minmatchall = minmatch;
+    nfa->maxmatchall = maxmatch;
+    nfa->flags |= MATCHALL;
+}
+
+/*
+ * checkmatchall_recurse - recursive search for checkmatchall
+ *
+ * s is the current state
+ * foundloop is true if any predecessor state has a loop-to-self
+ * depth is the current recursion depth (starting at -1)
+ * hasmatch[] is the output area for recording feasible match lengths
+ *
+ * We return true if there is at least one all-RAINBOW path to the "post"
+ * state and no non-matchall paths; otherwise false.  Note we assume that
+ * any dead-end paths have already been removed, else we might return
+ * false unnecessarily.
+ */
+static bool
+checkmatchall_recurse(struct nfa *nfa, struct state *s,
+                      bool foundloop, int depth,
+                      bool *hasmatch)
+{
+    bool        result = false;
+    struct arc *a;
+
+    /*
+     * Since this is recursive, it could be driven to stack overflow.  But we
+     * need not treat that as a hard failure; just deem the NFA non-matchall.
+     */
+    if (STACK_TOO_DEEP(nfa->v->re))
+        return false;
+
+    /*
+     * Likewise, if we get to a depth too large to represent correctly in
+     * maxmatchall, fail quietly.
+     */
+    if (depth >= DUPINF)
+        return false;
+
+    /*
+     * Scan the outarcs to detect cases we can't handle, and to see if there
+     * is a loop-to-self here.  We need to know about any such loop before we
+     * recurse, so it's hard to avoid making two passes over the outarcs.  In
+     * any case, checking for showstoppers before we recurse is probably best.
+     */
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->type != PLAIN)
+            return false;        /* any LACONs make it non-matchall */
+        if (a->co != RAINBOW)
+        {
+            if (nfa->cm->cd[a->co].flags & PSEUDO)
+            {
+                /*
+                 * Pseudocolor arc: verify it's in a valid place (this seems
+                 * quite unlikely to fail, but let's be sure).
+                 */
+                if (s == nfa->pre &&
+                    (a->co == nfa->bos[0] || a->co == nfa->bos[1]))
+                     /* okay BOS/BOL arc */ ;
+                else if (a->to == nfa->post &&
+                         (a->co == nfa->eos[0] || a->co == nfa->eos[1]))
+                     /* okay EOS/EOL arc */ ;
+                else
+                    return false;    /* unexpected pseudocolor arc */
+                /* We'll finish checking these arcs after the recursion */
+                continue;
+            }
+            return false;        /* any other color makes it non-matchall */
+        }
+        if (a->to == s)
+        {
+            /*
+             * We found a cycle of length 1, so remember that to pass down to
+             * successor states.  (It doesn't matter if there was also such a
+             * loop at a predecessor state.)
+             */
+            foundloop = true;
+        }
+        else if (a->to->tmp)
+        {
+            /* We found a cycle of length > 1, so fail. */
+            return false;
+        }
+    }
+
+    /* We need to recurse, so mark state as under consideration */
+    assert(s->tmp == NULL);
+    s->tmp = s;
+
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->co != RAINBOW)
+            continue;            /* ignore pseudocolor arcs */
+        if (a->to == nfa->post)
+        {
+            /* We found an all-RAINBOW path to the post state */
+            result = true;
+            /* Record potential match lengths */
+            assert(depth >= 0);
+            hasmatch[depth] = true;
+            if (foundloop)
+            {
+                /* A predecessor loop makes all larger lengths match, too */
+                int            i;
+
+                for (i = depth + 1; i <= DUPINF; i++)
+                    hasmatch[i] = true;
+            }
+        }
+        else if (a->to != s)
+        {
+            /* This is a new path forward; recurse to investigate */
+            result = checkmatchall_recurse(nfa, a->to,
+                                           foundloop, depth + 1,
+                                           hasmatch);
+            /* Fail if any recursive path fails */
+            if (!result)
+                break;
+        }
+    }
+
+    s->tmp = NULL;
+    return result;
+}
+
+/*
+ * check_out_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s outarc of color co1 has a matching outarc of color co2.
+ * (checkmatchall_recurse already verified that all of the outarcs are PLAIN,
+ * so we need not examine arc types here.)
+ */
+static bool
+check_out_colors_match(struct state *s, color co1, color co2)
+{
+    struct arc *a1;
+    struct arc *a2;
+
+    for (a1 = s->outs; a1 != NULL; a1 = a1->outchain)
+    {
+        if (a1->co != co1)
+            continue;
+        for (a2 = s->outs; a2 != NULL; a2 = a2->outchain)
+        {
+            if (a2->co == co2 && a2->to == a1->to)
+                break;
+        }
+        if (a2 == NULL)
+            return false;
+    }
+    return true;
+}
+
+/*
+ * check_in_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s inarc of color co1 has a matching inarc of color co2.
+ * (For paranoia's sake, ignore any non-PLAIN arcs here.)
+ */
+static bool
+check_in_colors_match(struct state *s, color co1, color co2)
+{
+    struct arc *a1;
+    struct arc *a2;
+
+    for (a1 = s->ins; a1 != NULL; a1 = a1->inchain)
+    {
+        if (a1->type != PLAIN || a1->co != co1)
+            continue;
+        for (a2 = s->ins; a2 != NULL; a2 = a2->inchain)
+        {
+            if (a2->type == PLAIN && a2->co == co2 && a2->from == a1->from)
+                break;
+        }
+        if (a2 == NULL)
+            return false;
+    }
+    return true;
+}
+
 /*
  * compact - construct the compact representation of an NFA
  */
@@ -2930,7 +3211,9 @@ compact(struct nfa *nfa,
     cnfa->eos[0] = nfa->eos[0];
     cnfa->eos[1] = nfa->eos[1];
     cnfa->ncolors = maxcolor(nfa->cm) + 1;
-    cnfa->flags = 0;
+    cnfa->flags = nfa->flags;
+    cnfa->minmatchall = nfa->minmatchall;
+    cnfa->maxmatchall = nfa->maxmatchall;

     ca = cnfa->arcs;
     for (s = nfa->states; s != NULL; s = s->next)
@@ -3034,6 +3317,11 @@ dumpnfa(struct nfa *nfa,
         fprintf(f, ", eos [%ld]", (long) nfa->eos[0]);
     if (nfa->eos[1] != COLORLESS)
         fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
+    if (nfa->flags & HASLACONS)
+        fprintf(f, ", haslacons");
+    if (nfa->flags & MATCHALL)
+        fprintf(f, ", minmatchall %d, maxmatchall %d",
+                nfa->minmatchall, nfa->maxmatchall);
     fprintf(f, "\n");
     for (s = nfa->states; s != NULL; s = s->next)
     {
@@ -3201,6 +3489,9 @@ dumpcnfa(struct cnfa *cnfa,
         fprintf(f, ", eol [%ld]", (long) cnfa->eos[1]);
     if (cnfa->flags & HASLACONS)
         fprintf(f, ", haslacons");
+    if (cnfa->flags & MATCHALL)
+        fprintf(f, ", minmatchall %d, maxmatchall %d",
+                cnfa->minmatchall, cnfa->maxmatchall);
     fprintf(f, "\n");
     for (st = 0; st < cnfa->nstates; st++)
         dumpcstate(st, cnfa, f);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index e73476040d..b228aedbd9 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -175,6 +175,11 @@ static void cleanup(struct nfa *);
 static void markreachable(struct nfa *, struct state *, struct state *, struct state *);
 static void markcanreach(struct nfa *, struct state *, struct state *, struct state *);
 static long analyze(struct nfa *);
+static void checkmatchall(struct nfa *);
+static bool checkmatchall_recurse(struct nfa *, struct state *,
+                                  bool, int, bool *);
+static bool check_out_colors_match(struct state *, color, color);
+static bool check_in_colors_match(struct state *, color, color);
 static void compact(struct nfa *, struct cnfa *);
 static void carcsort(struct carc *, size_t);
 static int    carc_cmp(const void *, const void *);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 32be2592c5..20ec463204 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -58,6 +58,29 @@ longest(struct vars *v,
     if (hitstopp != NULL)
         *hitstopp = 0;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = stop - start;
+        size_t        maxmatchall = d->cnfa->maxmatchall;
+
+        if (nchr < d->cnfa->minmatchall)
+            return NULL;
+        if (maxmatchall == DUPINF)
+        {
+            if (stop == v->stop && hitstopp != NULL)
+                *hitstopp = 1;
+        }
+        else
+        {
+            if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
+                *hitstopp = 1;
+            if (nchr > maxmatchall)
+                return start + maxmatchall;
+        }
+        return stop;
+    }
+
     /* initialize */
     css = initialize(v, d, start);
     if (css == NULL)
@@ -187,6 +210,24 @@ shortest(struct vars *v,
     if (hitstopp != NULL)
         *hitstopp = 0;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = min - start;
+
+        if (d->cnfa->maxmatchall != DUPINF &&
+            nchr > d->cnfa->maxmatchall)
+            return NULL;
+        if ((max - start) < d->cnfa->minmatchall)
+            return NULL;
+        if (nchr < d->cnfa->minmatchall)
+            min = start + d->cnfa->minmatchall;
+        if (coldp != NULL)
+            *coldp = start;
+        /* there is no case where we should set *hitstopp */
+        return min;
+    }
+
     /* initialize */
     css = initialize(v, d, start);
     if (css == NULL)
@@ -312,6 +353,22 @@ matchuntil(struct vars *v,
     struct sset *ss;
     struct colormap *cm = d->cm;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = probe - v->start;
+
+        /*
+         * It might seem that we should check maxmatchall too, but the
+         * implicit .* at the front of the pattern absorbs any extra
+         * characters.  Thus, we should always match as long as there are at
+         * least minmatchall characters.
+         */
+        if (nchr < d->cnfa->minmatchall)
+            return 0;
+        return 1;
+    }
+
     /* initialize and startup, or restart, if necessary */
     if (cp == NULL || cp > probe)
     {
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index e2fbad7a8a..ec435b6f5f 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -77,6 +77,10 @@ pg_regprefix(regex_t *re,
     assert(g->tree != NULL);
     cnfa = &g->tree->cnfa;

+    /* matchall NFAs never have a fixed prefix */
+    if (cnfa->flags & MATCHALL)
+        return REG_NOMATCH;
+
     /*
      * Since a correct NFA should never contain any exit-free loops, it should
      * not be possible for our traversal to return to a previously visited NFA
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 6d39108319..82e761bfe5 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -331,6 +331,9 @@ struct nfa
     struct colormap *cm;        /* the color map */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
     color        eos[2];            /* colors, if any, assigned to EOS and EOL */
+    int            flags;            /* flags to pass forward to cNFA */
+    int            minmatchall;    /* min number of chrs to match, if matchall */
+    int            maxmatchall;    /* max number of chrs to match, or DUPINF */
     struct vars *v;                /* simplifies compile error reporting */
     struct nfa *parent;            /* parent NFA, if any */
 };
@@ -353,6 +356,14 @@ struct nfa
  *
  * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
  * it doesn't break the rule about how to recognize LACON arcs.
+ *
+ * We have special markings for "trivial" NFAs that can match any string
+ * (possibly with limits on the number of characters therein).  In such a
+ * case, flags & MATCHALL is set (and HASLACONS can't be set).  Then the
+ * fields minmatchall and maxmatchall give the minimum and maximum numbers
+ * of characters to match.  For example, ".*" produces minmatchall = 0
+ * and maxmatchall = DUPINF, while ".+" produces minmatchall = 1 and
+ * maxmatchall = DUPINF.
  */
 struct carc
 {
@@ -366,6 +377,7 @@ struct cnfa
     int            ncolors;        /* number of colors (max color in use + 1) */
     int            flags;
 #define  HASLACONS    01            /* uses lookaround constraints */
+#define  MATCHALL    02            /* matches all strings of a range of lengths */
     int            pre;            /* setup state number */
     int            post;            /* teardown state number */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
@@ -375,6 +387,9 @@ struct cnfa
     struct carc **states;        /* vector of pointers to outarc lists */
     /* states[n] are pointers into a single malloc'd array of arcs */
     struct carc *arcs;            /* the area for the lists */
+    /* these fields are used only in a MATCHALL NFA (else they're -1): */
+    int            minmatchall;    /* min number of chrs to match */
+    int            maxmatchall;    /* max number of chrs to match, or DUPINF */
 };

 /*
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 0dc2265d8b..f01ca071d9 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3315,6 +3315,21 @@ select * from test_regex('(?=b)b', 'a', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)

+-- expectMatch    23.9 HP        ...(?!.)    abcde    cde
+select * from test_regex('...(?!.)', 'abcde', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {cde}
+(2 rows)
+
+-- expectNomatch    23.10 HP    ...(?=.)    abc
+select * from test_regex('...(?=.)', 'abc', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+(1 row)
+
 -- Postgres addition: lookbehind constraints
 -- expectMatch    23.11 HPN        (?<=a)b*    ab    b
 select * from test_regex('(?<=a)b*', 'ab', 'HPN');
@@ -3376,6 +3391,39 @@ select * from test_regex('(?<=b)b', 'b', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)

+-- expectMatch    23.19 HP        (?<=.)..    abcde    bc
+select * from test_regex('(?<=.)..', 'abcde', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bc}
+(2 rows)
+
+-- expectMatch    23.20 HP        (?<=..)a*    aaabb    a
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {a}
+(2 rows)
+
+-- expectMatch    23.21 HP        (?<=..)b*    aaabb    {}
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {""}
+(2 rows)
+
+-- expectMatch    23.22 HP        (?<=..)b+    aaabb    bb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');
+            test_regex
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bb}
+(2 rows)
+
 -- doing 24 "non-greedy quantifiers"
 -- expectMatch    24.1  PT    ab+?        abb    ab
 select * from test_regex('ab+?', 'abb', 'PT');
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 1a2bfa6235..ae7d6b43e4 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1049,6 +1049,10 @@ select * from test_regex('a(?!b)b*', 'a', 'HP');
 select * from test_regex('(?=b)b', 'b', 'HP');
 -- expectNomatch    23.8 HP        (?=b)b        a
 select * from test_regex('(?=b)b', 'a', 'HP');
+-- expectMatch    23.9 HP        ...(?!.)    abcde    cde
+select * from test_regex('...(?!.)', 'abcde', 'HP');
+-- expectNomatch    23.10 HP    ...(?=.)    abc
+select * from test_regex('...(?=.)', 'abc', 'HP');

 -- Postgres addition: lookbehind constraints

@@ -1068,6 +1072,15 @@ select * from test_regex('a(?<!b)b*', 'a', 'HP');
 select * from test_regex('(?<=b)b', 'bb', 'HP');
 -- expectNomatch    23.18 HP        (?<=b)b        b
 select * from test_regex('(?<=b)b', 'b', 'HP');
+-- expectMatch    23.19 HP        (?<=.)..    abcde    bc
+select * from test_regex('(?<=.)..', 'abcde', 'HP');
+-- expectMatch    23.20 HP        (?<=..)a*    aaabb    a
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+-- expectMatch    23.21 HP        (?<=..)b*    aaabb    {}
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+-- expectMatch    23.22 HP        (?<=..)b+    aaabb    bb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');

 -- doing 24 "non-greedy quantifiers"

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index b228aedbd9..6cf4209d30 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -1123,14 +1123,24 @@ parseqatom(struct vars *v,
     t->left = atom;
     atomp = &t->left;
 
-    /* here we should recurse... but we must postpone that to the end */
+    /*
+     * Here we should recurse to fill t->right ... but we must postpone that
+     * to the end.
+     */
 
-    /* split top into prefix and remaining */
+    /*
+     * Convert top node to a concatenation of the prefix (top->left, covering
+     * whatever we parsed previously) and remaining (t).  Note that the prefix
+     * could be empty, in which case this concatenation node is unnecessary.
+     * To keep things simple, we operate in a general way for now, and get rid
+     * of unnecessary subres below.
+     */
     assert(top->op == '=' && top->left == NULL && top->right == NULL);
     top->left = subre(v, '=', top->flags, top->begin, lp);
     NOERR();
     top->op = '.';
     top->right = t;
+    /* top->flags will get updated later */
 
     /* if it's a backref, now is the time to replicate the subNFA */
     if (atomtype == BACKREF)
@@ -1220,16 +1230,75 @@ parseqatom(struct vars *v,
     /* and finally, look after that postponed recursion */
     t = top->right;
     if (!(SEE('|') || SEE(stopper) || SEE(EOS)))
+    {
+        /* parse all the rest of the branch, and insert in t->right */
         t->right = parsebranch(v, stopper, type, s2, rp, 1);
+        NOERR();
+        assert(SEE('|') || SEE(stopper) || SEE(EOS));
+
+        /* here's the promised update of the flags */
+        t->flags |= COMBINE(t->flags, t->right->flags);
+        top->flags |= COMBINE(top->flags, t->flags);
+
+        /*
+         * At this point both top and t are concatenation (op == '.') subres,
+         * and we have top->left = prefix of branch, top->right = t, t->left =
+         * messy atom (with quantification superstructure if needed), t->right
+         * = rest of branch.
+         *
+         * If the messy atom was the first thing in the branch, then top->left
+         * is vacuous and we can get rid of one level of concatenation.  Since
+         * the caller is holding a pointer to the top node, we can't remove
+         * that node; but we're allowed to change its properties.
+         */
+        assert(top->left->op == '=');
+        if (top->left->begin == top->left->end)
+        {
+            assert(!MESSY(top->left->flags));
+            freesubre(v, top->left);
+            top->left = t->left;
+            top->right = t->right;
+            t->left = t->right = NULL;
+            freesubre(v, t);
+        }
+    }
     else
     {
+        /*
+         * There's nothing left in the branch, so we don't need the second
+         * concatenation node 't'.  Just link s2 straight to rp.
+         */
         EMPTYARC(s2, rp);
-        t->right = subre(v, '=', 0, s2, rp);
+        top->right = t->left;
+        top->flags |= COMBINE(top->flags, top->right->flags);
+        t->left = t->right = NULL;
+        freesubre(v, t);
+
+        /*
+         * Again, it could be that top->left is vacuous (if the messy atom was
+         * in fact the only thing in the branch).  In that case we need no
+         * concatenation at all; just replace top with top->right.
+         */
+        assert(top->left->op == '=');
+        if (top->left->begin == top->left->end)
+        {
+            assert(!MESSY(top->left->flags));
+            freesubre(v, top->left);
+            t = top->right;
+            top->op = t->op;
+            top->flags = t->flags;
+            top->id = t->id;
+            top->subno = t->subno;
+            top->min = t->min;
+            top->max = t->max;
+            top->left = t->left;
+            top->right = t->right;
+            top->begin = t->begin;
+            top->end = t->end;
+            t->left = t->right = NULL;
+            freesubre(v, t);
+        }
     }
-    NOERR();
-    assert(SEE('|') || SEE(stopper) || SEE(EOS));
-    t->flags |= COMBINE(t->flags, t->right->flags);
-    top->flags |= COMBINE(top->flags, t->flags);
 }
 
 /*
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 6cf4209d30..c688806992 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -58,6 +58,7 @@ static void processlacon(struct vars *, struct state *, struct state *, int,
                          struct state *, struct state *);
 static struct subre *subre(struct vars *, int, int, struct state *, struct state *);
 static void freesubre(struct vars *, struct subre *);
+static void freesubreandsiblings(struct vars *, struct subre *);
 static void freesrnode(struct vars *, struct subre *);
 static void optst(struct vars *, struct subre *);
 static int    numst(struct subre *, int);
@@ -652,8 +653,8 @@ makesearch(struct vars *v,
  * parse - parse an RE
  *
  * This is actually just the top level, which parses a bunch of branches
- * tied together with '|'.  They appear in the tree as the left children
- * of a chain of '|' subres.
+ * tied together with '|'.  If there's more than one, they appear in the
+ * tree as the children of a '|' subre.
  */
 static struct subre *
 parse(struct vars *v,
@@ -662,41 +663,34 @@ parse(struct vars *v,
       struct state *init,        /* initial state */
       struct state *final)        /* final state */
 {
-    struct state *left;            /* scaffolding for branch */
-    struct state *right;
     struct subre *branches;        /* top level */
-    struct subre *branch;        /* current branch */
-    struct subre *t;            /* temporary */
-    int            firstbranch;    /* is this the first branch? */
+    struct subre *lastbranch;    /* latest branch */

     assert(stopper == ')' || stopper == EOS);

     branches = subre(v, '|', LONGER, init, final);
     NOERRN();
-    branch = branches;
-    firstbranch = 1;
+    lastbranch = NULL;
     do
     {                            /* a branch */
-        if (!firstbranch)
-        {
-            /* need a place to hang it */
-            branch->right = subre(v, '|', LONGER, init, final);
-            NOERRN();
-            branch = branch->right;
-        }
-        firstbranch = 0;
+        struct subre *branch;
+        struct state *left;        /* scaffolding for branch */
+        struct state *right;
+
         left = newstate(v->nfa);
         right = newstate(v->nfa);
         NOERRN();
         EMPTYARC(init, left);
         EMPTYARC(right, final);
         NOERRN();
-        branch->left = parsebranch(v, stopper, type, left, right, 0);
+        branch = parsebranch(v, stopper, type, left, right, 0);
         NOERRN();
-        branch->flags |= UP(branch->flags | branch->left->flags);
-        if ((branch->flags & ~branches->flags) != 0)    /* new flags */
-            for (t = branches; t != branch; t = t->right)
-                t->flags |= branch->flags;
+        if (lastbranch)
+            lastbranch->sibling = branch;
+        else
+            branches->child = branch;
+        branches->flags |= UP(branches->flags | branch->flags);
+        lastbranch = branch;
     } while (EAT('|'));
     assert(SEE(stopper) || SEE(EOS));

@@ -707,20 +701,16 @@ parse(struct vars *v,
     }

     /* optimize out simple cases */
-    if (branch == branches)
+    if (lastbranch == branches->child)
     {                            /* only one branch */
-        assert(branch->right == NULL);
-        t = branch->left;
-        branch->left = NULL;
-        freesubre(v, branches);
-        branches = t;
+        assert(lastbranch->sibling == NULL);
+        freesrnode(v, branches);
+        branches = lastbranch;
     }
     else if (!MESSY(branches->flags))
     {                            /* no interesting innards */
-        freesubre(v, branches->left);
-        branches->left = NULL;
-        freesubre(v, branches->right);
-        branches->right = NULL;
+        freesubreandsiblings(v, branches->child);
+        branches->child = NULL;
         branches->op = '=';
     }

@@ -972,7 +962,7 @@ parseqatom(struct vars *v,
                 t = subre(v, '(', atom->flags | CAP, lp, rp);
                 NOERR();
                 t->subno = subno;
-                t->left = atom;
+                t->child = atom;
                 atom = t;
             }
             /* postpone everything else pending possible {0} */
@@ -1120,26 +1110,27 @@ parseqatom(struct vars *v,
     /* break remaining subRE into x{...} and what follows */
     t = subre(v, '.', COMBINE(qprefer, atom->flags), lp, rp);
     NOERR();
-    t->left = atom;
-    atomp = &t->left;
+    t->child = atom;
+    atomp = &t->child;

     /*
-     * Here we should recurse to fill t->right ... but we must postpone that
-     * to the end.
+     * Here we should recurse to fill t->child->sibling ... but we must
+     * postpone that to the end.  One reason is that t->child may be replaced
+     * below, and we don't want to worry about its sibling link.
      */

     /*
-     * Convert top node to a concatenation of the prefix (top->left, covering
+     * Convert top node to a concatenation of the prefix (top->child, covering
      * whatever we parsed previously) and remaining (t).  Note that the prefix
      * could be empty, in which case this concatenation node is unnecessary.
      * To keep things simple, we operate in a general way for now, and get rid
      * of unnecessary subres below.
      */
-    assert(top->op == '=' && top->left == NULL && top->right == NULL);
-    top->left = subre(v, '=', top->flags, top->begin, lp);
+    assert(top->op == '=' && top->child == NULL);
+    top->child = subre(v, '=', top->flags, top->begin, lp);
     NOERR();
     top->op = '.';
-    top->right = t;
+    top->child->sibling = t;
     /* top->flags will get updated later */

     /* if it's a backref, now is the time to replicate the subNFA */
@@ -1201,9 +1192,9 @@ parseqatom(struct vars *v,
         f = COMBINE(qprefer, atom->flags);
         t = subre(v, '.', f, s, atom->end); /* prefix and atom */
         NOERR();
-        t->left = subre(v, '=', PREF(f), s, atom->begin);
+        t->child = subre(v, '=', PREF(f), s, atom->begin);
         NOERR();
-        t->right = atom;
+        t->child->sibling = atom;
         *atomp = t;
         /* rest of branch can be strung starting from atom->end */
         s2 = atom->end;
@@ -1222,44 +1213,43 @@ parseqatom(struct vars *v,
         NOERR();
         t->min = (short) m;
         t->max = (short) n;
-        t->left = atom;
+        t->child = atom;
         *atomp = t;
         /* rest of branch is to be strung from iteration's end state */
     }

     /* and finally, look after that postponed recursion */
-    t = top->right;
+    t = top->child->sibling;
     if (!(SEE('|') || SEE(stopper) || SEE(EOS)))
     {
-        /* parse all the rest of the branch, and insert in t->right */
-        t->right = parsebranch(v, stopper, type, s2, rp, 1);
+        /* parse all the rest of the branch, and insert in t->child->sibling */
+        t->child->sibling = parsebranch(v, stopper, type, s2, rp, 1);
         NOERR();
         assert(SEE('|') || SEE(stopper) || SEE(EOS));

         /* here's the promised update of the flags */
-        t->flags |= COMBINE(t->flags, t->right->flags);
+        t->flags |= COMBINE(t->flags, t->child->sibling->flags);
         top->flags |= COMBINE(top->flags, t->flags);

         /*
          * At this point both top and t are concatenation (op == '.') subres,
-         * and we have top->left = prefix of branch, top->right = t, t->left =
-         * messy atom (with quantification superstructure if needed), t->right
-         * = rest of branch.
+         * and we have top->child = prefix of branch, top->child->sibling = t,
+         * t->child = messy atom (with quantification superstructure if
+         * needed), t->child->sibling = rest of branch.
          *
-         * If the messy atom was the first thing in the branch, then top->left
-         * is vacuous and we can get rid of one level of concatenation.  Since
-         * the caller is holding a pointer to the top node, we can't remove
-         * that node; but we're allowed to change its properties.
+         * If the messy atom was the first thing in the branch, then
+         * top->child is vacuous and we can get rid of one level of
+         * concatenation.  Since the caller is holding a pointer to the top
+         * node, we can't remove that node; but we're allowed to change its
+         * properties.
          */
-        assert(top->left->op == '=');
-        if (top->left->begin == top->left->end)
+        assert(top->child->op == '=');
+        if (top->child->begin == top->child->end)
         {
-            assert(!MESSY(top->left->flags));
-            freesubre(v, top->left);
-            top->left = t->left;
-            top->right = t->right;
-            t->left = t->right = NULL;
-            freesubre(v, t);
+            assert(!MESSY(top->child->flags));
+            freesubre(v, top->child);
+            top->child = t->child;
+            freesrnode(v, t);
         }
     }
     else
@@ -1269,34 +1259,31 @@ parseqatom(struct vars *v,
          * concatenation node 't'.  Just link s2 straight to rp.
          */
         EMPTYARC(s2, rp);
-        top->right = t->left;
-        top->flags |= COMBINE(top->flags, top->right->flags);
-        t->left = t->right = NULL;
-        freesubre(v, t);
+        top->child->sibling = t->child;
+        top->flags |= COMBINE(top->flags, top->child->sibling->flags);
+        freesrnode(v, t);

         /*
-         * Again, it could be that top->left is vacuous (if the messy atom was
-         * in fact the only thing in the branch).  In that case we need no
-         * concatenation at all; just replace top with top->right.
+         * Again, it could be that top->child is vacuous (if the messy atom
+         * was in fact the only thing in the branch).  In that case we need no
+         * concatenation at all; just replace top with top->child->sibling.
          */
-        assert(top->left->op == '=');
-        if (top->left->begin == top->left->end)
+        assert(top->child->op == '=');
+        if (top->child->begin == top->child->end)
         {
-            assert(!MESSY(top->left->flags));
-            freesubre(v, top->left);
-            t = top->right;
+            assert(!MESSY(top->child->flags));
+            t = top->child->sibling;
+            freesubre(v, top->child);
             top->op = t->op;
             top->flags = t->flags;
             top->id = t->id;
             top->subno = t->subno;
             top->min = t->min;
             top->max = t->max;
-            top->left = t->left;
-            top->right = t->right;
+            top->child = t->child;
             top->begin = t->begin;
             top->end = t->end;
-            t->left = t->right = NULL;
-            freesubre(v, t);
+            freesrnode(v, t);
         }
     }
 }
@@ -1786,7 +1773,7 @@ subre(struct vars *v,
     }

     if (ret != NULL)
-        v->treefree = ret->left;
+        v->treefree = ret->child;
     else
     {
         ret = (struct subre *) MALLOC(sizeof(struct subre));
@@ -1806,8 +1793,8 @@ subre(struct vars *v,
     ret->id = 0;                /* will be assigned later */
     ret->subno = 0;
     ret->min = ret->max = 1;
-    ret->left = NULL;
-    ret->right = NULL;
+    ret->child = NULL;
+    ret->sibling = NULL;
     ret->begin = begin;
     ret->end = end;
     ZAPCNFA(ret->cnfa);
@@ -1817,6 +1804,9 @@ subre(struct vars *v,

 /*
  * freesubre - free a subRE subtree
+ *
+ * This frees child node(s) of the given subRE too,
+ * but not its siblings.
  */
 static void
 freesubre(struct vars *v,        /* might be NULL */
@@ -1825,14 +1815,31 @@ freesubre(struct vars *v,        /* might be NULL */
     if (sr == NULL)
         return;

-    if (sr->left != NULL)
-        freesubre(v, sr->left);
-    if (sr->right != NULL)
-        freesubre(v, sr->right);
+    if (sr->child != NULL)
+        freesubreandsiblings(v, sr->child);

     freesrnode(v, sr);
 }

+/*
+ * freesubreandsiblings - free a subRE subtree
+ *
+ * This frees child node(s) of the given subRE too,
+ * as well as any following siblings.
+ */
+static void
+freesubreandsiblings(struct vars *v,    /* might be NULL */
+                     struct subre *sr)
+{
+    while (sr != NULL)
+    {
+        struct subre *next = sr->sibling;
+
+        freesubre(v, sr);
+        sr = next;
+    }
+}
+
 /*
  * freesrnode - free one node in a subRE subtree
  */
@@ -1850,7 +1857,7 @@ freesrnode(struct vars *v,        /* might be NULL */
     if (v != NULL && v->treechain != NULL)
     {
         /* we're still parsing, maybe we can reuse the subre */
-        sr->left = v->treefree;
+        sr->child = v->treefree;
         v->treefree = sr;
     }
     else
@@ -1881,15 +1888,14 @@ numst(struct subre *t,
       int start)                /* starting point for subtree numbers */
 {
     int            i;
+    struct subre *t2;

     assert(t != NULL);

     i = start;
     t->id = (short) i++;
-    if (t->left != NULL)
-        i = numst(t->left, i);
-    if (t->right != NULL)
-        i = numst(t->right, i);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        i = numst(t2, i);
     return i;
 }

@@ -1913,13 +1919,13 @@ numst(struct subre *t,
 static void
 markst(struct subre *t)
 {
+    struct subre *t2;
+
     assert(t != NULL);

     t->flags |= INUSE;
-    if (t->left != NULL)
-        markst(t->left);
-    if (t->right != NULL)
-        markst(t->right);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        markst(t2);
 }

 /*
@@ -1949,12 +1955,12 @@ nfatree(struct vars *v,
         struct subre *t,
         FILE *f)                /* for debug output */
 {
+    struct subre *t2;
+
     assert(t != NULL && t->begin != NULL);

-    if (t->left != NULL)
-        (DISCARD) nfatree(v, t->left, f);
-    if (t->right != NULL)
-        (DISCARD) nfatree(v, t->right, f);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        (DISCARD) nfatree(v, t2, f);

     return nfanode(v, t, 0, f);
 }
@@ -2206,6 +2212,7 @@ stdump(struct subre *t,
        int nfapresent)            /* is the original NFA still around? */
 {
     char        idbuf[50];
+    struct subre *t2;

     fprintf(f, "%s. `%c'", stid(t, idbuf, sizeof(idbuf)), t->op);
     if (t->flags & LONGER)
@@ -2231,20 +2238,18 @@ stdump(struct subre *t,
     }
     if (nfapresent)
         fprintf(f, " %ld-%ld", (long) t->begin->no, (long) t->end->no);
-    if (t->left != NULL)
-        fprintf(f, " L:%s", stid(t->left, idbuf, sizeof(idbuf)));
-    if (t->right != NULL)
-        fprintf(f, " R:%s", stid(t->right, idbuf, sizeof(idbuf)));
+    if (t->child != NULL)
+        fprintf(f, " C:%s", stid(t->child, idbuf, sizeof(idbuf)));
+    if (t->sibling != NULL)
+        fprintf(f, " S:%s", stid(t->sibling, idbuf, sizeof(idbuf)));
     if (!NULLCNFA(t->cnfa))
     {
         fprintf(f, "\n");
         dumpcnfa(&t->cnfa, f);
     }
     fprintf(f, "\n");
-    if (t->left != NULL)
-        stdump(t->left, f, nfapresent);
-    if (t->right != NULL)
-        stdump(t->right, f, nfapresent);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        stdump(t2, f, nfapresent);
 }

 /*
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index f7eaa76b02..4b65e38329 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -635,6 +635,8 @@ static void
 zaptreesubs(struct vars *v,
             struct subre *t)
 {
+    struct subre *t2;
+
     if (t->op == '(')
     {
         int            n = t->subno;
@@ -647,10 +649,8 @@ zaptreesubs(struct vars *v,
         }
     }

-    if (t->left != NULL)
-        zaptreesubs(v, t->left);
-    if (t->right != NULL)
-        zaptreesubs(v, t->right);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        zaptreesubs(v, t2);
 }

 /*
@@ -709,35 +709,35 @@ cdissect(struct vars *v,
     switch (t->op)
     {
         case '=':                /* terminal node */
-            assert(t->left == NULL && t->right == NULL);
+            assert(t->child == NULL);
             er = REG_OKAY;        /* no action, parent did the work */
             break;
         case 'b':                /* back reference */
-            assert(t->left == NULL && t->right == NULL);
+            assert(t->child == NULL);
             er = cbrdissect(v, t, begin, end);
             break;
         case '.':                /* concatenation */
-            assert(t->left != NULL && t->right != NULL);
-            if (t->left->flags & SHORTER)    /* reverse scan */
+            assert(t->child != NULL);
+            if (t->child->flags & SHORTER)    /* reverse scan */
                 er = crevcondissect(v, t, begin, end);
             else
                 er = ccondissect(v, t, begin, end);
             break;
         case '|':                /* alternation */
-            assert(t->left != NULL);
+            assert(t->child != NULL);
             er = caltdissect(v, t, begin, end);
             break;
         case '*':                /* iteration */
-            assert(t->left != NULL);
-            if (t->left->flags & SHORTER)    /* reverse scan */
+            assert(t->child != NULL);
+            if (t->child->flags & SHORTER)    /* reverse scan */
                 er = creviterdissect(v, t, begin, end);
             else
                 er = citerdissect(v, t, begin, end);
             break;
         case '(':                /* capturing */
-            assert(t->left != NULL && t->right == NULL);
+            assert(t->child != NULL);
             assert(t->subno > 0);
-            er = cdissect(v, t->left, begin, end);
+            er = cdissect(v, t->child, begin, end);
             if (er == REG_OKAY)
                 subset(v, t, begin, end);
             break;
@@ -765,19 +765,22 @@ ccondissect(struct vars *v,
             chr *begin,            /* beginning of relevant substring */
             chr *end)            /* end of same */
 {
+    struct subre *left = t->child;
+    struct subre *right = left->sibling;
     struct dfa *d;
     struct dfa *d2;
     chr           *mid;
     int            er;

     assert(t->op == '.');
-    assert(t->left != NULL && t->left->cnfa.nstates > 0);
-    assert(t->right != NULL && t->right->cnfa.nstates > 0);
-    assert(!(t->left->flags & SHORTER));
+    assert(left != NULL && left->cnfa.nstates > 0);
+    assert(right != NULL && right->cnfa.nstates > 0);
+    assert(right->sibling == NULL);
+    assert(!(left->flags & SHORTER));

-    d = getsubdfa(v, t->left);
+    d = getsubdfa(v, left);
     NOERR();
-    d2 = getsubdfa(v, t->right);
+    d2 = getsubdfa(v, right);
     NOERR();
     MDEBUG(("cconcat %d\n", t->id));

@@ -794,10 +797,10 @@ ccondissect(struct vars *v,
         /* try this midpoint on for size */
         if (longest(v, d2, mid, end, (int *) NULL) == end)
         {
-            er = cdissect(v, t->left, begin, mid);
+            er = cdissect(v, left, begin, mid);
             if (er == REG_OKAY)
             {
-                er = cdissect(v, t->right, mid, end);
+                er = cdissect(v, right, mid, end);
                 if (er == REG_OKAY)
                 {
                     /* satisfaction */
@@ -826,8 +829,8 @@ ccondissect(struct vars *v,
             return REG_NOMATCH;
         }
         MDEBUG(("%d: new midpoint %ld\n", t->id, LOFF(mid)));
-        zaptreesubs(v, t->left);
-        zaptreesubs(v, t->right);
+        zaptreesubs(v, left);
+        zaptreesubs(v, right);
     }

     /* can't get here */
@@ -843,19 +846,22 @@ crevcondissect(struct vars *v,
                chr *begin,        /* beginning of relevant substring */
                chr *end)        /* end of same */
 {
+    struct subre *left = t->child;
+    struct subre *right = left->sibling;
     struct dfa *d;
     struct dfa *d2;
     chr           *mid;
     int            er;

     assert(t->op == '.');
-    assert(t->left != NULL && t->left->cnfa.nstates > 0);
-    assert(t->right != NULL && t->right->cnfa.nstates > 0);
-    assert(t->left->flags & SHORTER);
+    assert(left != NULL && left->cnfa.nstates > 0);
+    assert(right != NULL && right->cnfa.nstates > 0);
+    assert(right->sibling == NULL);
+    assert(left->flags & SHORTER);

-    d = getsubdfa(v, t->left);
+    d = getsubdfa(v, left);
     NOERR();
-    d2 = getsubdfa(v, t->right);
+    d2 = getsubdfa(v, right);
     NOERR();
     MDEBUG(("crevcon %d\n", t->id));

@@ -872,10 +878,10 @@ crevcondissect(struct vars *v,
         /* try this midpoint on for size */
         if (longest(v, d2, mid, end, (int *) NULL) == end)
         {
-            er = cdissect(v, t->left, begin, mid);
+            er = cdissect(v, left, begin, mid);
             if (er == REG_OKAY)
             {
-                er = cdissect(v, t->right, mid, end);
+                er = cdissect(v, right, mid, end);
                 if (er == REG_OKAY)
                 {
                     /* satisfaction */
@@ -904,8 +910,8 @@ crevcondissect(struct vars *v,
             return REG_NOMATCH;
         }
         MDEBUG(("%d: new midpoint %ld\n", t->id, LOFF(mid)));
-        zaptreesubs(v, t->left);
-        zaptreesubs(v, t->right);
+        zaptreesubs(v, left);
+        zaptreesubs(v, right);
     }

     /* can't get here */
@@ -1005,26 +1011,30 @@ caltdissect(struct vars *v,
     struct dfa *d;
     int            er;

-    /* We loop, rather than tail-recurse, to handle a chain of alternatives */
+    assert(t->op == '|');
+
+    t = t->child;
+    /* there should be at least 2 alternatives */
+    assert(t != NULL && t->sibling != NULL);
+
     while (t != NULL)
     {
-        assert(t->op == '|');
-        assert(t->left != NULL && t->left->cnfa.nstates > 0);
+        assert(t->cnfa.nstates > 0);

         MDEBUG(("calt n%d\n", t->id));

-        d = getsubdfa(v, t->left);
+        d = getsubdfa(v, t);
         NOERR();
         if (longest(v, d, begin, end, (int *) NULL) == end)
         {
             MDEBUG(("calt matched\n"));
-            er = cdissect(v, t->left, begin, end);
+            er = cdissect(v, t, begin, end);
             if (er != REG_NOMATCH)
                 return er;
         }
         NOERR();

-        t = t->right;
+        t = t->sibling;
     }

     return REG_NOMATCH;
@@ -1050,8 +1060,8 @@ citerdissect(struct vars *v,
     int            er;

     assert(t->op == '*');
-    assert(t->left != NULL && t->left->cnfa.nstates > 0);
-    assert(!(t->left->flags & SHORTER));
+    assert(t->child != NULL && t->child->cnfa.nstates > 0);
+    assert(!(t->child->flags & SHORTER));
     assert(begin <= end);

     /*
@@ -1086,7 +1096,7 @@ citerdissect(struct vars *v,
         return REG_ESPACE;
     endpts[0] = begin;

-    d = getsubdfa(v, t->left);
+    d = getsubdfa(v, t->child);
     if (ISERR())
     {
         FREE(endpts);
@@ -1165,8 +1175,8 @@ citerdissect(struct vars *v,

         for (i = nverified + 1; i <= k; i++)
         {
-            zaptreesubs(v, t->left);
-            er = cdissect(v, t->left, endpts[i - 1], endpts[i]);
+            zaptreesubs(v, t->child);
+            er = cdissect(v, t->child, endpts[i - 1], endpts[i]);
             if (er == REG_OKAY)
             {
                 nverified = i;
@@ -1251,8 +1261,8 @@ creviterdissect(struct vars *v,
     int            er;

     assert(t->op == '*');
-    assert(t->left != NULL && t->left->cnfa.nstates > 0);
-    assert(t->left->flags & SHORTER);
+    assert(t->child != NULL && t->child->cnfa.nstates > 0);
+    assert(t->child->flags & SHORTER);
     assert(begin <= end);

     /*
@@ -1287,7 +1297,7 @@ creviterdissect(struct vars *v,
         return REG_ESPACE;
     endpts[0] = begin;

-    d = getsubdfa(v, t->left);
+    d = getsubdfa(v, t->child);
     if (ISERR())
     {
         FREE(endpts);
@@ -1372,8 +1382,8 @@ creviterdissect(struct vars *v,

         for (i = nverified + 1; i <= k; i++)
         {
-            zaptreesubs(v, t->left);
-            er = cdissect(v, t->left, endpts[i - 1], endpts[i]);
+            zaptreesubs(v, t->child);
+            er = cdissect(v, t->child, endpts[i - 1], endpts[i]);
             if (er == REG_OKAY)
             {
                 nverified = i;
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 82e761bfe5..249c44982f 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -423,15 +423,17 @@ struct cnfa
  *        '='  plain regex without interesting substructure (implemented as DFA)
  *        'b'  back-reference (has no substructure either)
  *        '('  capture node: captures the match of its single child
- *        '.'  concatenation: matches a match for left, then a match for right
- *        '|'  alternation: matches a match for left or a match for right
+ *        '.'  concatenation: matches a match for first child, then second child
+ *        '|'  alternation: matches a match for any of its children
  *        '*'  iteration: matches some number of matches of its single child
  *
- * Note: the right child of an alternation must be another alternation or
- * NULL; hence, an N-way branch requires N alternation nodes, not N-1 as you
- * might expect.  This could stand to be changed.  Actually I'd rather see
- * a single alternation node with N children, but that will take revising
- * the representation of struct subre.
+ * An alternation node can have any number of children (but at least two),
+ * linked through their sibling fields.
+ *
+ * A concatenation node must have exactly two children.  It might be useful
+ * to support more, but that would complicate the executor.  Note that it is
+ * the first child's greediness that determines the node's preference for
+ * where to split a match.
  *
  * Note: when a backref is directly quantified, we stick the min/max counts
  * into the backref rather than plastering an iteration node on top.  This is
@@ -460,8 +462,8 @@ struct subre
                                  * LATYPE code for lookaround constraint */
     short        min;            /* min repetitions for iteration or backref */
     short        max;            /* max repetitions for iteration or backref */
-    struct subre *left;            /* left child, if any (also freelist chain) */
-    struct subre *right;        /* right child, if any */
+    struct subre *child;        /* first child, if any (also freelist chain) */
+    struct subre *sibling;        /* next child of same parent, if any */
     struct state *begin;        /* outarcs from here... */
     struct state *end;            /* ...ending in inarcs here */
     struct cnfa cnfa;            /* compacted NFA, if any */

Re: Some regular-expression performance hacking

От

"Joel Jacobson"

Дата:

15 февраля 2021 г., 08:21:21

On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote:

>I got these runtimes (non-cassert builds):

>HEAD 313661.149 ms (05:13.661)

>+0001 297397.293 ms (04:57.397) 5% better than HEAD

>+0002 151995.803 ms (02:31.996) 51% better than HEAD

>+0003 139843.934 ms (02:19.844) 55% better than HEAD

>+0004 95034.611 ms (01:35.035) 69% better than HEAD

>Since I don't have all the tables used in your query, I can't

>try to reproduce your results exactly. I suspect the reason

>I'm getting a better percentage improvement than you did is

>that the joining/grouping/ordering involved in your query

>creates a higher baseline query cost.

Mind blowing speed-up, wow!

I've tested all 4 patches successfully.

To eliminate the baseline cost of the join,

I first created this table:

CREATE TABLE performance_test AS

SELECT

subjects.subject,

patterns.pattern,

tests.is_match,

tests.captured

FROM tests

JOIN subjects ON subjects.subject_id = tests.subject_id

JOIN patterns ON patterns.pattern_id = subjects.pattern_id

JOIN server_versions ON server_versions.server_version_num = tests.server_version_num

WHERE server_versions.server_version = current_setting('server_version')

AND tests.error IS NULL

;

Then I ran this query:

\timing

SELECT

is_match <> (subject ~ pattern),

captured IS DISTINCT FROM regexp_match(subject, pattern),

COUNT(*)

FROM performance_test

GROUP BY 1,2

ORDER BY 1,2

;

All patches gave the same result:

?column? | ?column? | count

----------+----------+---------

f | f | 1448212

(1 row)

I.e., no detected semantic differences.

Timing differences:

HEAD 570632.722 ms (09:30.633)

+0001 472938.857 ms (07:52.939) 17% better than HEAD

+0002 451638.049 ms (07:31.638) 20% better than HEAD

+0003 439377.813 ms (07:19.378) 23% better than HEAD

+0004 96447.038 ms (01:36.447) 83% better than HEAD

I tested on my MacBook Pro 2.4GHz 8-Core Intel Core i9, 32 GB 2400 MHz DDR4 running macOS Big Sur 11.1:

SELECT version();

version

----------------------------------------------------------------------------------------------------------------------

PostgreSQL 14devel on x86_64-apple-darwin20.2.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit

(1 row)

My HEAD = 46d6e5f567906389c31c4fb3a2653da1885c18ee.

PostgreSQL was compiled with just ./configure, no parameters, and the only non-default postgresql.conf settings were these:

log_destination = 'csvlog'

logging_collector = on

log_filename = 'postgresql.log'

Amazing work!

I hope to have a new dataset ready soon with regex flags for applied subjects as well.

/Joel

Re: Some regular-expression performance hacking

От

Tom Lane

Дата:

17 февраля 2021 г., 21:00:36

"Joel Jacobson" <joel@compiler.org> writes:
> I've tested all 4 patches successfully.

Thanks!

I found one other area that could be improved with the same idea of
getting rid of unnecessary subre's: right now, every pair of capturing
parentheses gives rise to a "capture" subre with an NFA identical to
its single child subre (which is what does the actual matching work).
While this doesn't really add any runtime cost, the duplicate NFA
definitely does add to the compilation cost, since we run it through
optimization independently of the child.

I initially thought that we could just flush capture subres altogether
in favor of annotating their children with a "we need to capture this
result" marker.  However, Spencer's regression tests soon exposed the
flaw in that notion.  It's legal to write "((x))" or even "((((x))))",
so that there can be multiple sets of capturing parentheses with a
single child node.  The solution adopted in the attached 0005 is to
handle the innermost capture with a marker on the child subre, but if
we need an additional capture on a node that's already marked, put
a capture subre on top just like before.  One could instead complicate
the data structure by allowing N capture markers on a single subre
node, but I judged that not to be a good tradeoff.  I don't see any
reason except spec compliance to allow such equivalent capture groups,
so I don't care if they're a bit inefficient.  (If anyone knows of a
useful application for writing REs like this, we could reconsider that
choice.)

One small issue with marking the child directly is that we can't get
away any longer with overlaying capture and backref subexpression
numbers, since you could theoretically write (\1) which'd result in
needing to put a capture label on a backref subre.  This could again
have been handled by making the capture a separate node, but I really
don't much care for the way that subre.subno has been overloaded for
three(!) different purposes depending on node type.  So I just broke
it into three separate fields.  Maybe the incremental cost of the
larger subre struct was worth worrying about back in 1997 ... but
I kind of doubt that it was a useful micro-optimization even then,
considering the additional NFA baggage that every subre carries.

Also, I widened "subre.id" from short to int, since the narrower field
no longer saves anything given the new struct layout.  The existing
choice was dubious already, because every other use of subre ID
numbers was int or even size_t, and there was nothing checking for
overflow of the id fields.  (Although perhaps it doesn't matter,
since I'm unsure that the id fields are really used for anything
except debugging purposes.)

For me, 0005 makes a fairly perceptible difference on your test case
subject_id = 611875, which I've been paying attention to because it's
the one that failed with "regular expression is too complex" before.
I see about a 20% time savings from 0004 on that case, but not really
any noticeable difference in the total runtime for the whole suite.
So I think we're getting to the point of diminishing returns for
this concept (another reason for not chasing after optimization of
the duplicate-captures case).  Still, we're clearly way ahead of
where we started.

Attached is an updated patch series; it's rebased over 4e703d671
which took care of some not-really-related fixes, and I made a
pass of cleanup and comment improvements.  I think this is pretty
much ready to commit, unless you want to do more testing or
code-reading.

            regards, tom lane

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1e4f0121f3..fcf03de32d 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -282,8 +282,8 @@ typedef struct
 typedef int TrgmColor;

 /* We assume that colors returned by the regexp engine cannot be these: */
-#define COLOR_UNKNOWN    (-1)
-#define COLOR_BLANK        (-2)
+#define COLOR_UNKNOWN    (-3)
+#define COLOR_BLANK        (-4)

 typedef struct
 {
@@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
         palloc0(colorsCount * sizeof(TrgmColorInfo));

     /*
-     * Loop over colors, filling TrgmColorInfo about each.
+     * Loop over colors, filling TrgmColorInfo about each.  Note we include
+     * WHITE (0) even though we know it'll be reported as non-expandable.
      */
     for (i = 0; i < colorsCount; i++)
     {
@@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
             /* Add enter key to this state */
             addKeyToQueue(trgmNFA, &destKey);
         }
-        else
+        else if (arc->co >= 0)
         {
-            /* Regular color */
+            /* Regular color (including WHITE) */
             TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];

             if (colorInfo->expandable)
@@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
                 addKeyToQueue(trgmNFA, &destKey);
             }
         }
+        else
+        {
+            /* RAINBOW: treat as unexpandable color */
+            destKey.prefix.colors[0] = COLOR_UNKNOWN;
+            destKey.prefix.colors[1] = COLOR_UNKNOWN;
+            destKey.nstate = arc->to;
+            addKeyToQueue(trgmNFA, &destKey);
+        }
     }

     pfree(arcs);
@@ -1216,10 +1225,10 @@ addArcs(TrgmNFA *trgmNFA, TrgmState *state)
             /*
              * Ignore non-expandable colors; addKey already handled the case.
              *
-             * We need no special check for begin/end pseudocolors here.  We
-             * don't need to do any processing for them, and they will be
-             * marked non-expandable since the regex engine will have reported
-             * them that way.
+             * We need no special check for WHITE or begin/end pseudocolors
+             * here.  We don't need to do any processing for them, and they
+             * will be marked non-expandable since the regex engine will have
+             * reported them that way.
              */
             if (!colorInfo->expandable)
                 continue;
diff --git a/src/backend/regex/README b/src/backend/regex/README
index f08aab69e3..a83ab5074d 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -261,6 +261,18 @@ and the NFA has these arcs:
     states 4 -> 5 on color 2 ("x" only)
 which can be seen to be a correct representation of the regex.

+There is one more complexity, which is how to handle ".", that is a
+match-anything atom.  We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot.  That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic.  Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors.  Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline.  In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
 Given this summary, we can see we need the following operations for
 colors:

@@ -349,6 +361,8 @@ The possible arc types are:

     PLAIN arcs, which specify matching of any character of a given "color"
     (see above).  These are dumped as "[color_number]->to_state".
+    In addition there can be "rainbow" PLAIN arcs, which are dumped as
+    "[*]->to_state".

     EMPTY arcs, which specify a no-op transition to another state.  These
     are dumped as "->to_state".
@@ -356,11 +370,11 @@ The possible arc types are:
     AHEAD constraints, which represent a "next character must be of this
     color" constraint.  AHEAD differs from a PLAIN arc in that the input
     character is not consumed when crossing the arc.  These are dumped as
-    ">color_number>->to_state".
+    ">color_number>->to_state", or possibly ">*>->to_state".

     BEHIND constraints, which represent a "previous character must be of
     this color" constraint, which likewise consumes no input.  These are
-    dumped as "<color_number<->to_state".
+    dumped as "<color_number<->to_state", or possibly "<*<->to_state".

     '^' arcs, which specify a beginning-of-input constraint.  These are
     dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
@@ -396,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
 the end of the input.
 3. If the NFA is (or can be) in the goal state at this point, it matches.

+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any.  The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
 So one can mentally execute an untransformed NFA by taking ^ and $ as
 ordinary constraints that match at start and end of input; but plain
 arcs out of the start state should be taken as matches for the character
 before the target substring, and similarly, plain arcs leading to the
 post state are matches for the character after the target substring.
-This definition is necessary to support regexes that begin or end with
-constraints such as \m and \M, which imply requirements on the adjacent
-character if any.  NFAs for simple unanchored patterns will usually have
-pre-state outarcs for all possible character colors as well as BOS and
-BOL, and post-state inarcs for all possible character colors as well as
-EOS and EOL, so that the executor's behavior will work.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index f5a4151757..0864011cce 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -977,6 +977,7 @@ colorchain(struct colormap *cm,
 {
     struct colordesc *cd = &cm->cd[a->co];

+    assert(a->co >= 0);
     if (cd->arcs != NULL)
         cd->arcs->colorchainRev = a;
     a->colorchain = cd->arcs;
@@ -994,6 +995,7 @@ uncolorchain(struct colormap *cm,
     struct colordesc *cd = &cm->cd[a->co];
     struct arc *aa = a->colorchainRev;

+    assert(a->co >= 0);
     if (aa == NULL)
     {
         assert(cd->arcs == a);
@@ -1012,6 +1014,9 @@ uncolorchain(struct colormap *cm,

 /*
  * rainbow - add arcs of all full colors (but one) between specified states
+ *
+ * If there isn't an exception color, we now generate just a single arc
+ * labeled RAINBOW, saving lots of arc-munging later on.
  */
 static void
 rainbow(struct nfa *nfa,
@@ -1025,6 +1030,13 @@ rainbow(struct nfa *nfa,
     struct colordesc *end = CDEND(cm);
     color        co;

+    if (but == COLORLESS)
+    {
+        newarc(nfa, type, RAINBOW, from, to);
+        return;
+    }
+
+    /* Gotta do it the hard way.  Skip subcolors, pseudocolors, and "but" */
     for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
         if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
             !(cd->flags & PSEUDO))
@@ -1034,13 +1046,16 @@ rainbow(struct nfa *nfa,
 /*
  * colorcomplement - add arcs of complementary colors
  *
+ * We add arcs of all colors that are not pseudocolors and do not match
+ * any of the "of" state's PLAIN outarcs.
+ *
  * The calling sequence ought to be reconciled with cloneouts().
  */
 static void
 colorcomplement(struct nfa *nfa,
                 struct colormap *cm,
                 int type,
-                struct state *of,    /* complements of this guy's PLAIN outarcs */
+                struct state *of,
                 struct state *from,
                 struct state *to)
 {
@@ -1049,6 +1064,11 @@ colorcomplement(struct nfa *nfa,
     color        co;

     assert(of != from);
+
+    /* A RAINBOW arc matches all colors, making the complement empty */
+    if (findarc(of, PLAIN, RAINBOW) != NULL)
+        return;
+
     for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
         if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
             if (findarc(of, PLAIN, co) == NULL)
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 7ed675f88a..ff98bfd694 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -271,6 +271,11 @@ destroystate(struct nfa *nfa,
  *
  * This function checks to make sure that no duplicate arcs are created.
  * In general we never want duplicates.
+ *
+ * However: in principle, a RAINBOW arc is redundant with any plain arc
+ * (unless that arc is for a pseudocolor).  But we don't try to recognize
+ * that redundancy, either here or in allied operations such as moveins().
+ * The pseudocolor consideration makes that more costly than it seems worth.
  */
 static void
 newarc(struct nfa *nfa,
@@ -1170,6 +1175,9 @@ copyouts(struct nfa *nfa,

 /*
  * cloneouts - copy out arcs of a state to another state pair, modifying type
+ *
+ * This is only used to convert PLAIN arcs to AHEAD/BEHIND arcs, which share
+ * the same interpretation of "co".  It wouldn't be sensible with LACONs.
  */
 static void
 cloneouts(struct nfa *nfa,
@@ -1181,9 +1189,13 @@ cloneouts(struct nfa *nfa,
     struct arc *a;

     assert(old != from);
+    assert(type == AHEAD || type == BEHIND);

     for (a = old->outs; a != NULL; a = a->outchain)
+    {
+        assert(a->type == PLAIN);
         newarc(nfa, type, a->co, from, to);
+    }
 }

 /*
@@ -1597,7 +1609,7 @@ pull(struct nfa *nfa,
     for (a = from->ins; a != NULL && !NISERR(); a = nexta)
     {
         nexta = a->inchain;
-        switch (combine(con, a))
+        switch (combine(nfa, con, a))
         {
             case INCOMPATIBLE:    /* destroy the arc */
                 freearc(nfa, a);
@@ -1624,6 +1636,10 @@ pull(struct nfa *nfa,
                 cparc(nfa, a, s, to);
                 freearc(nfa, a);
                 break;
+            case REPLACEARC:    /* replace arc's color */
+                newarc(nfa, a->type, con->co, a->from, to);
+                freearc(nfa, a);
+                break;
             default:
                 assert(NOTREACHED);
                 break;
@@ -1764,7 +1780,7 @@ push(struct nfa *nfa,
     for (a = to->outs; a != NULL && !NISERR(); a = nexta)
     {
         nexta = a->outchain;
-        switch (combine(con, a))
+        switch (combine(nfa, con, a))
         {
             case INCOMPATIBLE:    /* destroy the arc */
                 freearc(nfa, a);
@@ -1791,6 +1807,10 @@ push(struct nfa *nfa,
                 cparc(nfa, a, from, s);
                 freearc(nfa, a);
                 break;
+            case REPLACEARC:    /* replace arc's color */
+                newarc(nfa, a->type, con->co, from, a->to);
+                freearc(nfa, a);
+                break;
             default:
                 assert(NOTREACHED);
                 break;
@@ -1810,9 +1830,11 @@ push(struct nfa *nfa,
  * #def INCOMPATIBLE    1    // destroys arc
  * #def SATISFIED        2    // constraint satisfied
  * #def COMPATIBLE        3    // compatible but not satisfied yet
+ * #def REPLACEARC        4    // replace arc's color with constraint color
  */
 static int
-combine(struct arc *con,
+combine(struct nfa *nfa,
+        struct arc *con,
         struct arc *a)
 {
 #define  CA(ct,at)     (((ct)<<CHAR_BIT) | (at))
@@ -1827,14 +1849,46 @@ combine(struct arc *con,
         case CA(BEHIND, PLAIN):
             if (con->co == a->co)
                 return SATISFIED;
+            if (con->co == RAINBOW)
+            {
+                /* con is satisfied unless arc's color is a pseudocolor */
+                if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+                    return SATISFIED;
+            }
+            else if (a->co == RAINBOW)
+            {
+                /* con is incompatible if it's for a pseudocolor */
+                if (nfa->cm->cd[con->co].flags & PSEUDO)
+                    return INCOMPATIBLE;
+                /* otherwise, constraint constrains arc to be only its color */
+                return REPLACEARC;
+            }
             return INCOMPATIBLE;
             break;
         case CA('^', '^'):        /* collision, similar constraints */
         case CA('$', '$'):
-        case CA(AHEAD, AHEAD):
+            if (con->co == a->co)    /* true duplication */
+                return SATISFIED;
+            return INCOMPATIBLE;
+            break;
+        case CA(AHEAD, AHEAD):    /* collision, similar constraints */
         case CA(BEHIND, BEHIND):
             if (con->co == a->co)    /* true duplication */
                 return SATISFIED;
+            if (con->co == RAINBOW)
+            {
+                /* con is satisfied unless arc's color is a pseudocolor */
+                if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+                    return SATISFIED;
+            }
+            else if (a->co == RAINBOW)
+            {
+                /* con is incompatible if it's for a pseudocolor */
+                if (nfa->cm->cd[con->co].flags & PSEUDO)
+                    return INCOMPATIBLE;
+                /* otherwise, constraint constrains arc to be only its color */
+                return REPLACEARC;
+            }
             return INCOMPATIBLE;
             break;
         case CA('^', BEHIND):    /* collision, dissimilar constraints */
@@ -2895,6 +2949,7 @@ compact(struct nfa *nfa,
                     break;
                 case LACON:
                     assert(s->no != cnfa->pre);
+                    assert(a->co >= 0);
                     ca->co = (color) (cnfa->ncolors + a->co);
                     ca->to = a->to->no;
                     ca++;
@@ -3068,13 +3123,22 @@ dumparc(struct arc *a,
     switch (a->type)
     {
         case PLAIN:
-            fprintf(f, "[%ld]", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, "[*]");
+            else
+                fprintf(f, "[%ld]", (long) a->co);
             break;
         case AHEAD:
-            fprintf(f, ">%ld>", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, ">*>");
+            else
+                fprintf(f, ">%ld>", (long) a->co);
             break;
         case BEHIND:
-            fprintf(f, "<%ld<", (long) a->co);
+            if (a->co == RAINBOW)
+                fprintf(f, "<*<");
+            else
+                fprintf(f, "<%ld<", (long) a->co);
             break;
         case LACON:
             fprintf(f, ":%ld:", (long) a->co);
@@ -3161,7 +3225,9 @@ dumpcstate(int st,
     pos = 1;
     for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
     {
-        if (ca->co < cnfa->ncolors)
+        if (ca->co == RAINBOW)
+            fprintf(f, "\t[*]->%d", ca->to);
+        else if (ca->co < cnfa->ncolors)
             fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
         else
             fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index cd0caaa2d0..ae8dbe5819 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -158,7 +158,8 @@ static int    push(struct nfa *, struct arc *, struct state **);
 #define INCOMPATIBLE    1        /* destroys arc */
 #define SATISFIED    2            /* constraint satisfied */
 #define COMPATIBLE    3            /* compatible but not satisfied yet */
-static int    combine(struct arc *, struct arc *);
+#define REPLACEARC    4            /* replace arc's color with constraint color */
+static int    combine(struct nfa *nfa, struct arc *con, struct arc *a);
 static void fixempties(struct nfa *, FILE *);
 static struct state *emptyreachable(struct nfa *, struct state *,
                                     struct state *, struct arc **);
@@ -289,9 +290,11 @@ struct vars
 #define SBEGIN    'A'                /* beginning of string (even if not BOL) */
 #define SEND    'Z'                /* end of string (even if not EOL) */

-/* is an arc colored, and hence on a color chain? */
+/* is an arc colored, and hence should belong to a color chain? */
+/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
 #define COLORED(a) \
-    ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)
+    ((a)->co >= 0 && \
+     ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))


 /* static function list */
@@ -1393,7 +1396,8 @@ bracket(struct vars *v,
  * cbracket - handle complemented bracket expression
  * We do it by calling bracket() with dummy endpoints, and then complementing
  * the result.  The alternative would be to invoke rainbow(), and then delete
- * arcs as the b.e. is seen... but that gets messy.
+ * arcs as the b.e. is seen... but that gets messy, and is really quite
+ * infeasible now that rainbow() just puts out one RAINBOW arc.
  */
 static void
 cbracket(struct vars *v,
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 5695e158a5..32be2592c5 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -612,6 +612,7 @@ miss(struct vars *v,
     unsigned    h;
     struct carc *ca;
     struct sset *p;
+    int            ispseudocolor;
     int            ispost;
     int            noprogress;
     int            gotstate;
@@ -643,13 +644,15 @@ miss(struct vars *v,
      */
     for (i = 0; i < d->wordsper; i++)
         d->work[i] = 0;            /* build new stateset bitmap in d->work */
+    ispseudocolor = d->cm->cd[co].flags & PSEUDO;
     ispost = 0;
     noprogress = 1;
     gotstate = 0;
     for (i = 0; i < d->nstates; i++)
         if (ISBSET(css->states, i))
             for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
-                if (ca->co == co)
+                if (ca->co == co ||
+                    (ca->co == RAINBOW && !ispseudocolor))
                 {
                     BSET(d->work, ca->to);
                     gotstate = 1;
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index d4f940b8c3..a493dbe88c 100644
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -222,7 +222,8 @@ pg_reg_colorisend(const regex_t *regex, int co)
  * Get number of member chrs of color number "co".
  *
  * Note: we return -1 if the color number is invalid, or if it is a special
- * color (WHITE or a pseudocolor), or if the number of members is uncertain.
+ * color (WHITE, RAINBOW, or a pseudocolor), or if the number of members is
+ * uncertain.
  * Callers should not try to extract the members if -1 is returned.
  */
 int
@@ -233,7 +234,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
     assert(regex != NULL && regex->re_magic == REMAGIC);
     cm = &((struct guts *) regex->re_guts)->cmap;

-    if (co <= 0 || co > cm->max)    /* we reject 0 which is WHITE */
+    if (co <= 0 || co > cm->max)    /* <= 0 rejects WHITE and RAINBOW */
         return -1;
     if (cm->cd[co].flags & PSEUDO)    /* also pseudocolors (BOS etc) */
         return -1;
@@ -257,7 +258,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
  * whose length chars_len must be at least as long as indicated by
  * pg_reg_getnumcharacters(), else not all chars will be returned.
  *
- * Fetching the members of WHITE or a pseudocolor is not supported.
+ * Fetching the members of WHITE, RAINBOW, or a pseudocolor is not supported.
  *
  * Caution: this is a relatively expensive operation.
  */
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 1d4593ac94..e2fbad7a8a 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -165,9 +165,13 @@ findprefix(struct cnfa *cnfa,
             /* We can ignore BOS/BOL arcs */
             if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
                 continue;
-            /* ... but EOS/EOL arcs terminate the search, as do LACONs */
+
+            /*
+             * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs
+             * and LACONs
+             */
             if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] ||
-                ca->co >= cnfa->ncolors)
+                ca->co == RAINBOW || ca->co >= cnfa->ncolors)
             {
                 thiscolor = COLORLESS;
                 break;
diff --git a/src/include/regex/regexport.h b/src/include/regex/regexport.h
index e6209463f7..99c4fb854e 100644
--- a/src/include/regex/regexport.h
+++ b/src/include/regex/regexport.h
@@ -30,6 +30,10 @@

 #include "regex/regex.h"

+/* These macros must match corresponding ones in regguts.h: */
+#define COLOR_WHITE        0        /* color for chars not appearing in regex */
+#define COLOR_RAINBOW    (-2)    /* represents all colors except pseudocolors */
+
 /* information about one arc of a regex's NFA */
 typedef struct
 {
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 0a616562d0..6d39108319 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -130,11 +130,16 @@
 /*
  * As soon as possible, we map chrs into equivalence classes -- "colors" --
  * which are of much more manageable number.
+ *
+ * To further reduce the number of arcs in NFAs and DFAs, we also have a
+ * special RAINBOW "color" that can be assigned to an arc.  This is not a
+ * real color, in that it has no entry in color maps.
  */
 typedef short color;            /* colors of characters */

 #define MAX_COLOR    32767        /* max color (must fit in 'color' datatype) */
 #define COLORLESS    (-1)        /* impossible color */
+#define RAINBOW        (-2)        /* represents all colors except pseudocolors */
 #define WHITE        0            /* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */

@@ -276,7 +281,7 @@ struct state;
 struct arc
 {
     int            type;            /* 0 if free, else an NFA arc type code */
-    color        co;
+    color        co;                /* color the arc matches (possibly RAINBOW) */
     struct state *from;            /* where it's from (and contained within) */
     struct state *to;            /* where it's to */
     struct arc *outchain;        /* link in *from's outs chain or free chain */
@@ -284,6 +289,7 @@ struct arc
 #define  freechain    outchain    /* we do not maintain "freechainRev" */
     struct arc *inchain;        /* link in *to's ins chain */
     struct arc *inchainRev;        /* back-link in *to's ins chain */
+    /* these fields are not used when co == RAINBOW: */
     struct arc *colorchain;        /* link in color's arc chain */
     struct arc *colorchainRev;    /* back-link in color's arc chain */
 };
@@ -344,6 +350,9 @@ struct nfa
  * Plain arcs just store the transition color number as "co".  LACON arcs
  * store the lookaround constraint number plus cnfa.ncolors as "co".  LACON
  * arcs can be distinguished from plain by testing for co >= cnfa.ncolors.
+ *
+ * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
+ * it doesn't break the rule about how to recognize LACON arcs.
  */
 struct carc
 {
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index ff98bfd694..1a6c9ce183 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -65,6 +65,8 @@ newnfa(struct vars *v,
     nfa->v = v;
     nfa->bos[0] = nfa->bos[1] = COLORLESS;
     nfa->eos[0] = nfa->eos[1] = COLORLESS;
+    nfa->flags = 0;
+    nfa->minmatchall = nfa->maxmatchall = -1;
     nfa->parent = parent;        /* Precedes newfstate so parent is valid. */
     nfa->post = newfstate(nfa, '@');    /* number 0 */
     nfa->pre = newfstate(nfa, '>'); /* number 1 */
@@ -2875,8 +2877,14 @@ analyze(struct nfa *nfa)
     if (NISERR())
         return 0;

+    /* Detect whether NFA can't match anything */
     if (nfa->pre->outs == NULL)
         return REG_UIMPOSSIBLE;
+
+    /* Detect whether NFA matches all strings (possibly with length bounds) */
+    checkmatchall(nfa);
+
+    /* Detect whether NFA can possibly match a zero-length string */
     for (a = nfa->pre->outs; a != NULL; a = a->outchain)
         for (aa = a->to->outs; aa != NULL; aa = aa->outchain)
             if (aa->to == nfa->post)
@@ -2884,6 +2892,282 @@ analyze(struct nfa *nfa)
     return 0;
 }

+/*
+ * checkmatchall - does the NFA represent no more than a string length test?
+ *
+ * If so, set nfa->minmatchall and nfa->maxmatchall correctly (they are -1
+ * to begin with) and set the MATCHALL bit in nfa->flags.
+ *
+ * To succeed, we require all arcs to be PLAIN RAINBOW arcs, except for those
+ * for pseudocolors (i.e., BOS/BOL/EOS/EOL).  We must be able to reach the
+ * post state via RAINBOW arcs, and if there are any loops in the graph, they
+ * must be loop-to-self arcs, ensuring that each loop iteration consumes
+ * exactly one character.  (Longer loops are problematic because they create
+ * non-consecutive possible match lengths; we have no good way to represent
+ * that situation for lengths beyond the DUPINF limit.)
+ *
+ * Pseudocolor arcs complicate things a little.  We know that they can only
+ * appear as pre-state outarcs (for BOS/BOL) or post-state inarcs (for
+ * EOS/EOL).  There, they must exactly replicate the parallel RAINBOW arcs,
+ * e.g. if the pre state has one RAINBOW outarc to state 2, it must have BOS
+ * and BOL outarcs to state 2, and no others.  Missing or extra pseudocolor
+ * arcs can occur, meaning that the NFA involves some constraint on the
+ * adjacent characters, which makes it not a matchall NFA.
+ */
+static void
+checkmatchall(struct nfa *nfa)
+{
+    bool        hasmatch[DUPINF + 1];
+    int            minmatch,
+                maxmatch,
+                morematch;
+
+    /*
+     * hasmatch[i] will be set true if a match of length i is feasible, for i
+     * from 0 to DUPINF-1.  hasmatch[DUPINF] will be set true if every match
+     * length of DUPINF or more is feasible.
+     */
+    memset(hasmatch, 0, sizeof(hasmatch));
+
+    /*
+     * Recursively search the graph for all-RAINBOW paths to the "post" state,
+     * starting at the "pre" state.  The -1 initial depth accounts for the
+     * fact that transitions out of the "pre" state are not part of the
+     * matched string.  We likewise don't count the final transition to the
+     * "post" state as part of the match length.  (But we still insist that
+     * those transitions have RAINBOW arcs, otherwise there are lookbehind or
+     * lookahead constraints at the start/end of the pattern.)
+     */
+    if (!checkmatchall_recurse(nfa, nfa->pre, false, -1, hasmatch))
+        return;
+
+    /*
+     * We found some all-RAINBOW paths, and not anything that we couldn't
+     * handle.  Now verify that pseudocolor arcs adjacent to the pre and post
+     * states match the RAINBOW arcs there.  (We could do this while
+     * recursing, but it's expensive and unlikely to fail, so do it last.)
+     */
+    if (!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[0]) ||
+        !check_out_colors_match(nfa->pre, nfa->bos[0], RAINBOW) ||
+        !check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[1]) ||
+        !check_out_colors_match(nfa->pre, nfa->bos[1], RAINBOW))
+        return;
+    if (!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[0]) ||
+        !check_in_colors_match(nfa->post, nfa->eos[0], RAINBOW) ||
+        !check_in_colors_match(nfa->post, RAINBOW, nfa->eos[1]) ||
+        !check_in_colors_match(nfa->post, nfa->eos[1], RAINBOW))
+        return;
+
+    /*
+     * hasmatch[] now represents the set of possible match lengths; but we
+     * want to reduce that to a min and max value, because it doesn't seem
+     * worth complicating regexec.c to deal with nonconsecutive possible match
+     * lengths.  Find min and max of first run of lengths, then verify there
+     * are no nonconsecutive lengths.
+     */
+    for (minmatch = 0; minmatch <= DUPINF; minmatch++)
+    {
+        if (hasmatch[minmatch])
+            break;
+    }
+    assert(minmatch <= DUPINF); /* else checkmatchall_recurse lied */
+    for (maxmatch = minmatch; maxmatch < DUPINF; maxmatch++)
+    {
+        if (!hasmatch[maxmatch + 1])
+            break;
+    }
+    for (morematch = maxmatch + 1; morematch <= DUPINF; morematch++)
+    {
+        if (hasmatch[morematch])
+            return;                /* fail, there are nonconsecutive lengths */
+    }
+
+    /* Success, so record the info */
+    nfa->minmatchall = minmatch;
+    nfa->maxmatchall = maxmatch;
+    nfa->flags |= MATCHALL;
+}
+
+/*
+ * checkmatchall_recurse - recursive search for checkmatchall
+ *
+ * s is the current state
+ * foundloop is true if any predecessor state has a loop-to-self
+ * depth is the current recursion depth (starting at -1)
+ * hasmatch[] is the output area for recording feasible match lengths
+ *
+ * We return true if there is at least one all-RAINBOW path to the "post"
+ * state and no non-matchall paths; otherwise false.  Note we assume that
+ * any dead-end paths have already been removed, else we might return
+ * false unnecessarily.
+ */
+static bool
+checkmatchall_recurse(struct nfa *nfa, struct state *s,
+                      bool foundloop, int depth,
+                      bool *hasmatch)
+{
+    bool        result = false;
+    struct arc *a;
+
+    /*
+     * Since this is recursive, it could be driven to stack overflow.  But we
+     * need not treat that as a hard failure; just deem the NFA non-matchall.
+     */
+    if (STACK_TOO_DEEP(nfa->v->re))
+        return false;
+
+    /*
+     * Likewise, if we get to a depth too large to represent correctly in
+     * maxmatchall, fail quietly.
+     */
+    if (depth >= DUPINF)
+        return false;
+
+    /*
+     * Scan the outarcs to detect cases we can't handle, and to see if there
+     * is a loop-to-self here.  We need to know about any such loop before we
+     * recurse, so it's hard to avoid making two passes over the outarcs.  In
+     * any case, checking for showstoppers before we recurse is probably best.
+     */
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->type != PLAIN)
+            return false;        /* any LACONs make it non-matchall */
+        if (a->co != RAINBOW)
+        {
+            if (nfa->cm->cd[a->co].flags & PSEUDO)
+            {
+                /*
+                 * Pseudocolor arc: verify it's in a valid place (this seems
+                 * quite unlikely to fail, but let's be sure).
+                 */
+                if (s == nfa->pre &&
+                    (a->co == nfa->bos[0] || a->co == nfa->bos[1]))
+                     /* okay BOS/BOL arc */ ;
+                else if (a->to == nfa->post &&
+                         (a->co == nfa->eos[0] || a->co == nfa->eos[1]))
+                     /* okay EOS/EOL arc */ ;
+                else
+                    return false;    /* unexpected pseudocolor arc */
+                /* We'll finish checking these arcs after the recursion */
+                continue;
+            }
+            return false;        /* any other color makes it non-matchall */
+        }
+        if (a->to == s)
+        {
+            /*
+             * We found a cycle of length 1, so remember that to pass down to
+             * successor states.  (It doesn't matter if there was also such a
+             * loop at a predecessor state.)
+             */
+            foundloop = true;
+        }
+        else if (a->to->tmp)
+        {
+            /* We found a cycle of length > 1, so fail. */
+            return false;
+        }
+    }
+
+    /* We need to recurse, so mark state as under consideration */
+    assert(s->tmp == NULL);
+    s->tmp = s;
+
+    for (a = s->outs; a != NULL; a = a->outchain)
+    {
+        if (a->co != RAINBOW)
+            continue;            /* ignore pseudocolor arcs */
+        if (a->to == nfa->post)
+        {
+            /* We found an all-RAINBOW path to the post state */
+            result = true;
+            /* Record potential match lengths */
+            assert(depth >= 0);
+            hasmatch[depth] = true;
+            if (foundloop)
+            {
+                /* A predecessor loop makes all larger lengths match, too */
+                int            i;
+
+                for (i = depth + 1; i <= DUPINF; i++)
+                    hasmatch[i] = true;
+            }
+        }
+        else if (a->to != s)
+        {
+            /* This is a new path forward; recurse to investigate */
+            result = checkmatchall_recurse(nfa, a->to,
+                                           foundloop, depth + 1,
+                                           hasmatch);
+            /* Fail if any recursive path fails */
+            if (!result)
+                break;
+        }
+    }
+
+    s->tmp = NULL;
+    return result;
+}
+
+/*
+ * check_out_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s outarc of color co1 has a matching outarc of color co2.
+ * (checkmatchall_recurse already verified that all of the outarcs are PLAIN,
+ * so we need not examine arc types here.  Also, since it verified that there
+ * are only RAINBOW and pseudocolor arcs, there shouldn't be enough arcs for
+ * this brute-force O(N^2) implementation to cause problems.)
+ */
+static bool
+check_out_colors_match(struct state *s, color co1, color co2)
+{
+    struct arc *a1;
+    struct arc *a2;
+
+    for (a1 = s->outs; a1 != NULL; a1 = a1->outchain)
+    {
+        if (a1->co != co1)
+            continue;
+        for (a2 = s->outs; a2 != NULL; a2 = a2->outchain)
+        {
+            if (a2->co == co2 && a2->to == a1->to)
+                break;
+        }
+        if (a2 == NULL)
+            return false;
+    }
+    return true;
+}
+
+/*
+ * check_in_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s inarc of color co1 has a matching inarc of color co2.
+ * (For paranoia's sake, ignore any non-PLAIN arcs here.  But we're still
+ * not expecting very many arcs.)
+ */
+static bool
+check_in_colors_match(struct state *s, color co1, color co2)
+{
+    struct arc *a1;
+    struct arc *a2;
+
+    for (a1 = s->ins; a1 != NULL; a1 = a1->inchain)
+    {
+        if (a1->type != PLAIN || a1->co != co1)
+            continue;
+        for (a2 = s->ins; a2 != NULL; a2 = a2->inchain)
+        {
+            if (a2->type == PLAIN && a2->co == co2 && a2->from == a1->from)
+                break;
+        }
+        if (a2 == NULL)
+            return false;
+    }
+    return true;
+}
+
 /*
  * compact - construct the compact representation of an NFA
  */
@@ -2930,7 +3214,9 @@ compact(struct nfa *nfa,
     cnfa->eos[0] = nfa->eos[0];
     cnfa->eos[1] = nfa->eos[1];
     cnfa->ncolors = maxcolor(nfa->cm) + 1;
-    cnfa->flags = 0;
+    cnfa->flags = nfa->flags;
+    cnfa->minmatchall = nfa->minmatchall;
+    cnfa->maxmatchall = nfa->maxmatchall;

     ca = cnfa->arcs;
     for (s = nfa->states; s != NULL; s = s->next)
@@ -3034,6 +3320,11 @@ dumpnfa(struct nfa *nfa,
         fprintf(f, ", eos [%ld]", (long) nfa->eos[0]);
     if (nfa->eos[1] != COLORLESS)
         fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
+    if (nfa->flags & HASLACONS)
+        fprintf(f, ", haslacons");
+    if (nfa->flags & MATCHALL)
+        fprintf(f, ", minmatchall %d, maxmatchall %d",
+                nfa->minmatchall, nfa->maxmatchall);
     fprintf(f, "\n");
     for (s = nfa->states; s != NULL; s = s->next)
     {
@@ -3201,6 +3492,9 @@ dumpcnfa(struct cnfa *cnfa,
         fprintf(f, ", eol [%ld]", (long) cnfa->eos[1]);
     if (cnfa->flags & HASLACONS)
         fprintf(f, ", haslacons");
+    if (cnfa->flags & MATCHALL)
+        fprintf(f, ", minmatchall %d, maxmatchall %d",
+                cnfa->minmatchall, cnfa->maxmatchall);
     fprintf(f, "\n");
     for (st = 0; st < cnfa->nstates; st++)
         dumpcstate(st, cnfa, f);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index ae8dbe5819..39e837adc2 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -175,6 +175,11 @@ static void cleanup(struct nfa *);
 static void markreachable(struct nfa *, struct state *, struct state *, struct state *);
 static void markcanreach(struct nfa *, struct state *, struct state *, struct state *);
 static long analyze(struct nfa *);
+static void checkmatchall(struct nfa *);
+static bool checkmatchall_recurse(struct nfa *, struct state *,
+                                  bool, int, bool *);
+static bool check_out_colors_match(struct state *, color, color);
+static bool check_in_colors_match(struct state *, color, color);
 static void compact(struct nfa *, struct cnfa *);
 static void carcsort(struct carc *, size_t);
 static int    carc_cmp(const void *, const void *);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 32be2592c5..89d162ed6a 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -58,6 +58,29 @@ longest(struct vars *v,
     if (hitstopp != NULL)
         *hitstopp = 0;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = stop - start;
+        size_t        maxmatchall = d->cnfa->maxmatchall;
+
+        if (nchr < d->cnfa->minmatchall)
+            return NULL;
+        if (maxmatchall == DUPINF)
+        {
+            if (stop == v->stop && hitstopp != NULL)
+                *hitstopp = 1;
+        }
+        else
+        {
+            if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
+                *hitstopp = 1;
+            if (nchr > maxmatchall)
+                return start + maxmatchall;
+        }
+        return stop;
+    }
+
     /* initialize */
     css = initialize(v, d, start);
     if (css == NULL)
@@ -187,6 +210,24 @@ shortest(struct vars *v,
     if (hitstopp != NULL)
         *hitstopp = 0;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = min - start;
+
+        if (d->cnfa->maxmatchall != DUPINF &&
+            nchr > d->cnfa->maxmatchall)
+            return NULL;
+        if ((max - start) < d->cnfa->minmatchall)
+            return NULL;
+        if (nchr < d->cnfa->minmatchall)
+            min = start + d->cnfa->minmatchall;
+        if (coldp != NULL)
+            *coldp = start;
+        /* there is no case where we should set *hitstopp */
+        return min;
+    }
+
     /* initialize */
     css = initialize(v, d, start);
     if (css == NULL)
@@ -312,6 +353,22 @@ matchuntil(struct vars *v,
     struct sset *ss;
     struct colormap *cm = d->cm;

+    /* fast path for matchall NFAs */
+    if (d->cnfa->flags & MATCHALL)
+    {
+        size_t        nchr = probe - v->start;
+
+        /*
+         * It might seem that we should check maxmatchall too, but the .* at
+         * the front of the pattern absorbs any extra characters (and it was
+         * tacked on *after* computing minmatchall/maxmatchall).  Thus, we
+         * should match if there are at least minmatchall characters.
+         */
+        if (nchr < d->cnfa->minmatchall)
+            return 0;
+        return 1;
+    }
+
     /* initialize and startup, or restart, if necessary */
     if (cp == NULL || cp > probe)
     {
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index e2fbad7a8a..ec435b6f5f 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -77,6 +77,10 @@ pg_regprefix(regex_t *re,
     assert(g->tree != NULL);
     cnfa = &g->tree->cnfa;

+    /* matchall NFAs never have a fixed prefix */
+    if (cnfa->flags & MATCHALL)
+        return REG_NOMATCH;
+
     /*
      * Since a correct NFA should never contain any exit-free loops, it should
      * not be possible for our traversal to return to a previously visited NFA
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 6d39108319..82e761bfe5 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -331,6 +331,9 @@ struct nfa
     struct colormap *cm;        /* the color map */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
     color        eos[2];            /* colors, if any, assigned to EOS and EOL */
+    int            flags;            /* flags to pass forward to cNFA */
+    int            minmatchall;    /* min number of chrs to match, if matchall */
+    int            maxmatchall;    /* max number of chrs to match, or DUPINF */
     struct vars *v;                /* simplifies compile error reporting */
     struct nfa *parent;            /* parent NFA, if any */
 };
@@ -353,6 +356,14 @@ struct nfa
  *
  * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
  * it doesn't break the rule about how to recognize LACON arcs.
+ *
+ * We have special markings for "trivial" NFAs that can match any string
+ * (possibly with limits on the number of characters therein).  In such a
+ * case, flags & MATCHALL is set (and HASLACONS can't be set).  Then the
+ * fields minmatchall and maxmatchall give the minimum and maximum numbers
+ * of characters to match.  For example, ".*" produces minmatchall = 0
+ * and maxmatchall = DUPINF, while ".+" produces minmatchall = 1 and
+ * maxmatchall = DUPINF.
  */
 struct carc
 {
@@ -366,6 +377,7 @@ struct cnfa
     int            ncolors;        /* number of colors (max color in use + 1) */
     int            flags;
 #define  HASLACONS    01            /* uses lookaround constraints */
+#define  MATCHALL    02            /* matches all strings of a range of lengths */
     int            pre;            /* setup state number */
     int            post;            /* teardown state number */
     color        bos[2];            /* colors, if any, assigned to BOS and BOL */
@@ -375,6 +387,9 @@ struct cnfa
     struct carc **states;        /* vector of pointers to outarc lists */
     /* states[n] are pointers into a single malloc'd array of arcs */
     struct carc *arcs;            /* the area for the lists */
+    /* these fields are used only in a MATCHALL NFA (else they're -1): */
+    int            minmatchall;    /* min number of chrs to match */
+    int            maxmatchall;    /* max number of chrs to match, or DUPINF */
 };

 /*
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 39e837adc2..0182e02db1 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -1123,14 +1123,24 @@ parseqatom(struct vars *v,
     t->left = atom;
     atomp = &t->left;
 
-    /* here we should recurse... but we must postpone that to the end */
+    /*
+     * Here we should recurse to fill t->right ... but we must postpone that
+     * to the end.
+     */
 
-    /* split top into prefix and remaining */
+    /*
+     * Convert top node to a concatenation of the prefix (top->left, covering
+     * whatever we parsed previously) and remaining (t).  Note that the prefix
+     * could be empty, in which case this concatenation node is unnecessary.
+     * To keep things simple, we operate in a general way for now, and get rid
+     * of unnecessary subres below.
+     */
     assert(top->op == '=' && top->left == NULL && top->right == NULL);
     top->left = subre(v, '=', top->flags, top->begin, lp);
     NOERR();
     top->op = '.';
     top->right = t;
+    /* top->flags will get updated later */
 
     /* if it's a backref, now is the time to replicate the subNFA */
     if (atomtype == BACKREF)
@@ -1220,16 +1230,75 @@ parseqatom(struct vars *v,
     /* and finally, look after that postponed recursion */
     t = top->right;
     if (!(SEE('|') || SEE(stopper) || SEE(EOS)))
+    {
+        /* parse all the rest of the branch, and insert in t->right */
         t->right = parsebranch(v, stopper, type, s2, rp, 1);
+        NOERR();
+        assert(SEE('|') || SEE(stopper) || SEE(EOS));
+
+        /* here's the promised update of the flags */
+        t->flags |= COMBINE(t->flags, t->right->flags);
+        top->flags |= COMBINE(top->flags, t->flags);
+
+        /*
+         * At this point both top and t are concatenation (op == '.') subres,
+         * and we have top->left = prefix of branch, top->right = t, t->left =
+         * messy atom (with quantification superstructure if needed), t->right
+         * = rest of branch.
+         *
+         * If the messy atom was the first thing in the branch, then top->left
+         * is vacuous and we can get rid of one level of concatenation.  Since
+         * the caller is holding a pointer to the top node, we can't remove
+         * that node; but we're allowed to change its properties.
+         */
+        assert(top->left->op == '=');
+        if (top->left->begin == top->left->end)
+        {
+            assert(!MESSY(top->left->flags));
+            freesubre(v, top->left);
+            top->left = t->left;
+            top->right = t->right;
+            t->left = t->right = NULL;
+            freesubre(v, t);
+        }
+    }
     else
     {
+        /*
+         * There's nothing left in the branch, so we don't need the second
+         * concatenation node 't'.  Just link s2 straight to rp.
+         */
         EMPTYARC(s2, rp);
-        t->right = subre(v, '=', 0, s2, rp);
+        top->right = t->left;
+        top->flags |= COMBINE(top->flags, top->right->flags);
+        t->left = t->right = NULL;
+        freesubre(v, t);
+
+        /*
+         * Again, it could be that top->left is vacuous (if the messy atom was
+         * in fact the only thing in the branch).  In that case we need no
+         * concatenation at all; just replace top with top->right.
+         */
+        assert(top->left->op == '=');
+        if (top->left->begin == top->left->end)
+        {
+            assert(!MESSY(top->left->flags));
+            freesubre(v, top->left);
+            t = top->right;
+            top->op = t->op;
+            top->flags = t->flags;
+            top->id = t->id;
+            top->subno = t->subno;
+            top->min = t->min;
+            top->max = t->max;
+            top->left = t->left;
+            top->right = t->right;
+            top->begin = t->begin;
+            top->end = t->end;
+            t->left = t->right = NULL;
+            freesubre(v, t);
+        }
     }
-    NOERR();
-    assert(SEE('|') || SEE(stopper) || SEE(EOS));
-    t->flags |= COMBINE(t->flags, t->right->flags);
-    top->flags |= COMBINE(top->flags, t->flags);
 }
 
 /*
diff --git a/src/backend/regex/README b/src/backend/regex/README
index a83ab5074d..cafeb3dffb 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -129,9 +129,9 @@ If not, we can reject the match immediately without iterating through many
 possibilities.

 As an example, consider the regex "(a[bc]+)\1".  The compiled
-representation will have a top-level concatenation subre node.  Its left
+representation will have a top-level concatenation subre node.  Its first
 child is a capture node, and the child of that is a plain DFA node for
-"a[bc]+".  The concatenation's right child is a backref node for \1.
+"a[bc]+".  The concatenation's second child is a backref node for \1.
 The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
 where the backref has been replaced by a copy of the DFA for its referent
 expression.  When executed, the concatenation node will have to search for
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 0182e02db1..4d483e7e53 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -58,6 +58,7 @@ static void processlacon(struct vars *, struct state *, struct state *, int,
                          struct state *, struct state *);
 static struct subre *subre(struct vars *, int, int, struct state *, struct state *);
 static void freesubre(struct vars *, struct subre *);
+static void freesubreandsiblings(struct vars *, struct subre *);
 static void freesrnode(struct vars *, struct subre *);
 static void optst(struct vars *, struct subre *);
 static int    numst(struct subre *, int);
@@ -652,8 +653,8 @@ makesearch(struct vars *v,
  * parse - parse an RE
  *
  * This is actually just the top level, which parses a bunch of branches
- * tied together with '|'.  They appear in the tree as the left children
- * of a chain of '|' subres.
+ * tied together with '|'.  If there's more than one, they appear in the
+ * tree as the children of a '|' subre.
  */
 static struct subre *
 parse(struct vars *v,
@@ -662,41 +663,34 @@ parse(struct vars *v,
       struct state *init,        /* initial state */
       struct state *final)        /* final state */
 {
-    struct state *left;            /* scaffolding for branch */
-    struct state *right;
     struct subre *branches;        /* top level */
-    struct subre *branch;        /* current branch */
-    struct subre *t;            /* temporary */
-    int            firstbranch;    /* is this the first branch? */
+    struct subre *lastbranch;    /* latest branch */

     assert(stopper == ')' || stopper == EOS);

     branches = subre(v, '|', LONGER, init, final);
     NOERRN();
-    branch = branches;
-    firstbranch = 1;
+    lastbranch = NULL;
     do
     {                            /* a branch */
-        if (!firstbranch)
-        {
-            /* need a place to hang it */
-            branch->right = subre(v, '|', LONGER, init, final);
-            NOERRN();
-            branch = branch->right;
-        }
-        firstbranch = 0;
+        struct subre *branch;
+        struct state *left;        /* scaffolding for branch */
+        struct state *right;
+
         left = newstate(v->nfa);
         right = newstate(v->nfa);
         NOERRN();
         EMPTYARC(init, left);
         EMPTYARC(right, final);
         NOERRN();
-        branch->left = parsebranch(v, stopper, type, left, right, 0);
+        branch = parsebranch(v, stopper, type, left, right, 0);
         NOERRN();
-        branch->flags |= UP(branch->flags | branch->left->flags);
-        if ((branch->flags & ~branches->flags) != 0)    /* new flags */
-            for (t = branches; t != branch; t = t->right)
-                t->flags |= branch->flags;
+        if (lastbranch)
+            lastbranch->sibling = branch;
+        else
+            branches->child = branch;
+        branches->flags |= UP(branches->flags | branch->flags);
+        lastbranch = branch;
     } while (EAT('|'));
     assert(SEE(stopper) || SEE(EOS));

@@ -707,20 +701,16 @@ parse(struct vars *v,
     }

     /* optimize out simple cases */
-    if (branch == branches)
+    if (lastbranch == branches->child)
     {                            /* only one branch */
-        assert(branch->right == NULL);
-        t = branch->left;
-        branch->left = NULL;
-        freesubre(v, branches);
-        branches = t;
+        assert(lastbranch->sibling == NULL);
+        freesrnode(v, branches);
+        branches = lastbranch;
     }
     else if (!MESSY(branches->flags))
     {                            /* no interesting innards */
-        freesubre(v, branches->left);
-        branches->left = NULL;
-        freesubre(v, branches->right);
-        branches->right = NULL;
+        freesubreandsiblings(v, branches->child);
+        branches->child = NULL;
         branches->op = '=';
     }

@@ -972,7 +962,7 @@ parseqatom(struct vars *v,
                 t = subre(v, '(', atom->flags | CAP, lp, rp);
                 NOERR();
                 t->subno = subno;
-                t->left = atom;
+                t->child = atom;
                 atom = t;
             }
             /* postpone everything else pending possible {0} */
@@ -1120,26 +1110,27 @@ parseqatom(struct vars *v,
     /* break remaining subRE into x{...} and what follows */
     t = subre(v, '.', COMBINE(qprefer, atom->flags), lp, rp);
     NOERR();
-    t->left = atom;
-    atomp = &t->left;
+    t->child = atom;
+    atomp = &t->child;

     /*
-     * Here we should recurse to fill t->right ... but we must postpone that
-     * to the end.
+     * Here we should recurse to fill t->child->sibling ... but we must
+     * postpone that to the end.  One reason is that t->child may be replaced
+     * below, and we don't want to worry about its sibling link.
      */

     /*
-     * Convert top node to a concatenation of the prefix (top->left, covering
+     * Convert top node to a concatenation of the prefix (top->child, covering
      * whatever we parsed previously) and remaining (t).  Note that the prefix
      * could be empty, in which case this concatenation node is unnecessary.
      * To keep things simple, we operate in a general way for now, and get rid
      * of unnecessary subres below.
      */
-    assert(top->op == '=' && top->left == NULL && top->right == NULL);
-    top->left = subre(v, '=', top->flags, top->begin, lp);
+    assert(top->op == '=' && top->child == NULL);
+    top->child = subre(v, '=', top->flags, top->begin, lp);
     NOERR();
     top->op = '.';
-    top->right = t;
+    top->child->sibling = t;
     /* top->flags will get updated later */

     /* if it's a backref, now is the time to replicate the subNFA */
@@ -1201,9 +1192,9 @@ parseqatom(struct vars *v,
         f = COMBINE(qprefer, atom->flags);
         t = subre(v, '.', f, s, atom->end); /* prefix and atom */
         NOERR();
-        t->left = subre(v, '=', PREF(f), s, atom->begin);
+        t->child = subre(v, '=', PREF(f), s, atom->begin);
         NOERR();
-        t->right = atom;
+        t->child->sibling = atom;
         *atomp = t;
         /* rest of branch can be strung starting from atom->end */
         s2 = atom->end;
@@ -1222,44 +1213,43 @@ parseqatom(struct vars *v,
         NOERR();
         t->min = (short) m;
         t->max = (short) n;
-        t->left = atom;
+        t->child = atom;
         *atomp = t;
         /* rest of branch is to be strung from iteration's end state */
     }

     /* and finally, look after that postponed recursion */
-    t = top->right;
+    t = top->child->sibling;
     if (!(SEE('|') || SEE(stopper) || SEE(EOS)))
     {
-        /* parse all the rest of the branch, and insert in t->right */
-        t->right = parsebranch(v, stopper, type, s2, rp, 1);
+        /* parse all the rest of the branch, and insert in t->child->sibling */
+        t->child->sibling = parsebranch(v, stopper, type, s2, rp, 1);
         NOERR();
         assert(SEE('|') || SEE(stopper) || SEE(EOS));

         /* here's the promised update of the flags */
-        t->flags |= COMBINE(t->flags, t->right->flags);
+        t->flags |= COMBINE(t->flags, t->child->sibling->flags);
         top->flags |= COMBINE(top->flags, t->flags);

         /*
          * At this point both top and t are concatenation (op == '.') subres,
-         * and we have top->left = prefix of branch, top->right = t, t->left =
-         * messy atom (with quantification superstructure if needed), t->right
-         * = rest of branch.
+         * and we have top->child = prefix of branch, top->child->sibling = t,
+         * t->child = messy atom (with quantification superstructure if
+         * needed), t->child->sibling = rest of branch.
          *
-         * If the messy atom was the first thing in the branch, then top->left
-         * is vacuous and we can get rid of one level of concatenation.  Since
-         * the caller is holding a pointer to the top node, we can't remove
-         * that node; but we're allowed to change its properties.
+         * If the messy atom was the first thing in the branch, then
+         * top->child is vacuous and we can get rid of one level of
+         * concatenation.  Since the caller is holding a pointer to the top
+         * node, we can't remove that node; but we're allowed to change its
+         * properties.
          */
-        assert(top->left->op == '=');
-        if (top->left->begin == top->left->end)
+        assert(top->child->op == '=');
+        if (top->child->begin == top->child->end)
         {
-            assert(!MESSY(top->left->flags));
-            freesubre(v, top->left);
-            top->left = t->left;
-            top->right = t->right;
-            t->left = t->right = NULL;
-            freesubre(v, t);
+            assert(!MESSY(top->child->flags));
+            freesubre(v, top->child);
+            top->child = t->child;
+            freesrnode(v, t);
         }
     }
     else
@@ -1269,34 +1259,31 @@ parseqatom(struct vars *v,
          * concatenation node 't'.  Just link s2 straight to rp.
          */
         EMPTYARC(s2, rp);
-        top->right = t->left;
-        top->flags |= COMBINE(top->flags, top->right->flags);
-        t->left = t->right = NULL;
-        freesubre(v, t);
+        top->child->sibling = t->child;
+        top->flags |= COMBINE(top->flags, top->child->sibling->flags);
+        freesrnode(v, t);

         /*
-         * Again, it could be that top->left is vacuous (if the messy atom was
-         * in fact the only thing in the branch).  In that case we need no
-         * concatenation at all; just replace top with top->right.
+         * Again, it could be that top->child is vacuous (if the messy atom
+         * was in fact the only thing in the branch).  In that case we need no
+         * concatenation at all; just replace top with top->child->sibling.
          */
-        assert(top->left->op == '=');
-        if (top->left->begin == top->left->end)
+        assert(top->child->op == '=');
+        if (top->child->begin == top->child->end)
         {
-            assert(!MESSY(top->left->flags));
-            freesubre(v, top->left);
-            t = top->right;
+            assert(!MESSY(top->child->flags));
+            t = top->child->sibling;
+            freesubre(v, top->child);
             top->op = t->op;
             top->flags = t->flags;
             top->id = t->id;
             top->subno = t->subno;
             top->min = t->min;
             top->max = t->max;
-            top->left = t->left;
-            top->right = t->right;
+            top->child = t->child;
             top->begin = t->begin;
             top->end = t->end;
-            t->left = t->right = NULL;
-            freesubre(v, t);
+            freesrnode(v, t);
         }
     }
 }
@@ -1786,7 +1773,7 @@ subre(struct vars *v,
     }

     if (ret != NULL)
-        v->treefree = ret->left;
+        v->treefree = ret->child;
     else
     {
         ret = (struct subre *) MALLOC(sizeof(struct subre));
@@ -1806,8 +1793,8 @@ subre(struct vars *v,
     ret->id = 0;                /* will be assigned later */
     ret->subno = 0;
     ret->min = ret->max = 1;
-    ret->left = NULL;
-    ret->right = NULL;
+    ret->child = NULL;
+    ret->sibling = NULL;
     ret->begin = begin;
     ret->end = end;
     ZAPCNFA(ret->cnfa);
@@ -1817,6 +1804,9 @@ subre(struct vars *v,

 /*
  * freesubre - free a subRE subtree
+ *
+ * This frees child node(s) of the given subRE too,
+ * but not its siblings.
  */
 static void
 freesubre(struct vars *v,        /* might be NULL */
@@ -1825,14 +1815,31 @@ freesubre(struct vars *v,        /* might be NULL */
     if (sr == NULL)
         return;

-    if (sr->left != NULL)
-        freesubre(v, sr->left);
-    if (sr->right != NULL)
-        freesubre(v, sr->right);
+    if (sr->child != NULL)
+        freesubreandsiblings(v, sr->child);

     freesrnode(v, sr);
 }

+/*
+ * freesubreandsiblings - free a subRE subtree
+ *
+ * This frees child node(s) of the given subRE too,
+ * as well as any following siblings.
+ */
+static void
+freesubreandsiblings(struct vars *v,    /* might be NULL */
+                     struct subre *sr)
+{
+    while (sr != NULL)
+    {
+        struct subre *next = sr->sibling;
+
+        freesubre(v, sr);
+        sr = next;
+    }
+}
+
 /*
  * freesrnode - free one node in a subRE subtree
  */
@@ -1850,7 +1857,7 @@ freesrnode(struct vars *v,        /* might be NULL */
     if (v != NULL && v->treechain != NULL)
     {
         /* we're still parsing, maybe we can reuse the subre */
-        sr->left = v->treefree;
+        sr->child = v->treefree;
         v->treefree = sr;
     }
     else
@@ -1881,15 +1888,14 @@ numst(struct subre *t,
       int start)                /* starting point for subtree numbers */
 {
     int            i;
+    struct subre *t2;

     assert(t != NULL);

     i = start;
     t->id = (short) i++;
-    if (t->left != NULL)
-        i = numst(t->left, i);
-    if (t->right != NULL)
-        i = numst(t->right, i);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        i = numst(t2, i);
     return i;
 }

@@ -1913,13 +1919,13 @@ numst(struct subre *t,
 static void
 markst(struct subre *t)
 {
+    struct subre *t2;
+
     assert(t != NULL);

     t->flags |= INUSE;
-    if (t->left != NULL)
-        markst(t->left);
-    if (t->right != NULL)
-        markst(t->right);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        markst(t2);
 }

 /*
@@ -1949,12 +1955,12 @@ nfatree(struct vars *v,
         struct subre *t,
         FILE *f)                /* for debug output */
 {
+    struct subre *t2;
+
     assert(t != NULL && t->begin != NULL);

-    if (t->left != NULL)
-        (DISCARD) nfatree(v, t->left, f);
-    if (t->right != NULL)
-        (DISCARD) nfatree(v, t->right, f);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        (DISCARD) nfatree(v, t2, f);

     return nfanode(v, t, 0, f);
 }
@@ -2206,6 +2212,7 @@ stdump(struct subre *t,
        int nfapresent)            /* is the original NFA still around? */
 {
     char        idbuf[50];
+    struct subre *t2;

     fprintf(f, "%s. `%c'", stid(t, idbuf, sizeof(idbuf)), t->op);
     if (t->flags & LONGER)
@@ -2231,20 +2238,18 @@ stdump(struct subre *t,
     }
     if (nfapresent)
         fprintf(f, " %ld-%ld", (long) t->begin->no, (long) t->end->no);
-    if (t->left != NULL)
-        fprintf(f, " L:%s", stid(t->left, idbuf, sizeof(idbuf)));
-    if (t->right != NULL)
-        fprintf(f, " R:%s", stid(t->right, idbuf, sizeof(idbuf)));
+    if (t->child != NULL)
+        fprintf(f, " C:%s", stid(t->child, idbuf, sizeof(idbuf)));
+    if (t->sibling != NULL)
+        fprintf(f, " S:%s", stid(t->sibling, idbuf, sizeof(idbuf)));
     if (!NULLCNFA(t->cnfa))
     {
         fprintf(f, "\n");
         dumpcnfa(&t->cnfa, f);
     }
     fprintf(f, "\n");
-    if (t->left != NULL)
-        stdump(t->left, f, nfapresent);
-    if (t->right != NULL)
-        stdump(t->right, f, nfapresent);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        stdump(t2, f, nfapresent);
 }

 /*
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index adcbcc0a8e..4541cf9a7e 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -640,6 +640,8 @@ static void
 zaptreesubs(struct vars *v,
             struct subre *t)
 {
+    struct subre *t2;
+
     if (t->op == '(')
     {
         int            n = t->subno;
@@ -652,10 +654,8 @@ zaptreesubs(struct vars *v,
         }
     }

-    if (t->left != NULL)
-        zaptreesubs(v, t->left);
-    if (t->right != NULL)
-        zaptreesubs(v, t->right);
+    for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+        zaptreesubs(v, t2);
 }

 /*
@@ -714,35 +714,35 @@ cdissect(struct vars *v,
     switch (t->op)
     {
         case '=':                /* terminal node */
-            assert(t->left == NULL && t->right == NULL);
+            assert(t->child == NULL);
             er = REG_OKAY;        /* no action, parent did the work */
             break;
         case 'b':                /* back reference */
-            assert(t->left == NULL && t->right == NULL);
+            assert(t->child == NULL);
             er = cbrdissect(v, t, begin, end);
             break;
         case '.':                /* concatenation */
-            assert(t->left != NULL && t->right != NULL);
-            if (t->left->flags & SHORTER)    /* reverse scan */
+            assert(t->child != NULL);
+            if (t->child->flags & SHORTER)    /* reverse scan */
                 er = crevcondissect(v, t, begin, end);
             else
                 er = ccondissect(v, t, begin, end);
             break;
         case '|':                /* alternation */
-            assert(t->left != NULL);
+            assert(t->child != NULL);
             er = caltdissect(v, t, begin, end);
             break;
         case '*':                /* iteration */
-            assert(t->left != NULL);
-            if (t->left->flags & SHORTER)    /* reverse scan */
+            assert(t->child != NULL);
+            if (t->child->flags & SHORTER)    /* reverse scan */
                 er = creviterdissect(v, t, begin, end);
             else
                 er = citerdissect(v, t, begin, end);
             break;
         case '(':                /* capturing */
-            assert(t->left != NULL && t->right == NULL);
+            assert(t->child != NULL);
             assert(t->subno > 0);
-            er = cdissect(v, t->left, begin, end);
+            er = cdissect(v, t->child, begin, end);
             if (er == REG_OKAY)
                 subset(v, t, begin, end);
             break;
@@ -770,19 +770,22 @@ ccondissect(struct vars *v,
             chr *begin,            /* beginning of relevant substring */
             chr *end)            /* end of same */
 {
+    struct subre *left = t->child;
+    struct subre *right = left->sibling;
     struct dfa *d;
     struct dfa *d2;
     chr           *mid;
     int            er;

     assert(t->op == '.');
-    assert(t->left != NULL && t->left->cnfa.nstates > 0);
-    assert(t->right != NULL && t->right->cnfa.nstates > 0);
-    assert(!(t->left->flags & SHORTER));
+    assert(left != NULL && left->cnfa.nstates > 0);
+    assert(right != NULL && right->cnfa.nstates > 0);
+    assert(right->sibling == NULL);
+    assert(!(left->flags & SHORTER));

-    d = getsubdfa(v, t->left);
+    d = getsubdfa(v, left);
     NOERR();
-    d2 = getsubdfa(v, t->right);
+    d2 = getsubdfa(v, right);
     NOERR();
     MDEBUG(("%d: ccondissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));

@@ -799,10 +802,10 @@ ccondissect(struct vars *v,
         /* try this midpoint on for size */
         if (longest(v, d2, mid, end, (int *) NULL) == end)
         {
-            er = cdissect(v, t->left, begin, mid);
+            er = cdissect(v, left, begin, mid);
             if (er == REG_OKAY)
             {
-                er = cdissect(v, t->right, mid, end);
+                er = cdissect(v, right, mid, end);
                 if (er == REG_OKAY)
                 {
                     /* satisfaction */
@@ -831,8 +834,8 @@ ccondissect(struct vars *v,
             return REG_NOMATCH;
         }
         MDEBUG(("%d: new midpoint %ld\n", t->id, LOFF(mid)));
-        zaptreesubs(v, t->left);
-        zaptreesubs(v, t->right);
+        zaptreesubs(v, left);
+        zaptreesubs(v, right);
     }

     /* can't get here */
@@ -848,19 +851,22 @@ crevcondissect(struct vars *v,
                chr *begin,        /* beginning of relevant substring */
                chr *end)        /* end of same */
 {
+    struct subre *left = t->child;
+    struct subre *right = left->sibling;
     struct dfa *d;
     struct dfa *d2;
     chr           *mid;
     int            er;

     assert(t->op == '.');
-    assert(t->left != NULL && t->left->cnfa.nstates > 0);
-    assert(t->right != NULL && t->right->cnfa.nstates > 0);
-    assert(t->left->flags & SHORTER);
+    assert(left != NULL && left->cnfa.nstates > 0);
+    assert(right != NULL && right->cnfa.nstates > 0);
+    assert(right->sibling == NULL);
+    assert(left->flags & SHORTER);

-    d = getsubdfa(v, t->left);
+    d = getsubdfa(v, left);
     NOERR();
-    d2 = getsubdfa(v, t->right);
+    d2 = getsubdfa(v, right);
     NOERR();
     MDEBUG(("%d: crevcondissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));

@@ -877,10 +883,10 @@ crevcondissect(struct vars *v,
         /* try this midpoint on for size */
         if (longest(v, d2, mid, end, (int *) NULL) == end)
         {
-            er = cdissect(v, t->left, begin, mid);
+            er = cdissect(v, left, begin, mid);
             if (er == REG_OKAY)
             {
-                er = cdissect(v, t->right, mid, end);
+                er = cdissect(v, right, mid, end);
                 if (er == REG_OKAY)
                 {
                     /* satisfaction */
@@ -909,8 +915,8 @@ crevcondissect(struct vars *v,
             return REG_NOMATCH;
         }
         MDEBUG(("%d: new midpoint %ld\n", t->id, LOFF(mid)));
-        zaptreesubs(v, t->left);
-        zaptreesubs(v, t->right);
+        zaptreesubs(v, left);
+        zaptreesubs(v, right);
     }

     /* can't get here */
@@ -1011,26 +1017,30 @@ caltdissect(struct vars *v,
     struct dfa *d;
     int            er;

-    /* We loop, rather than tail-recurse, to handle a chain of alternatives */
+    assert(t->op == '|');
+
+    t = t->child;
+    /* there should be at least 2 alternatives */
+    assert(t != NULL && t->sibling != NULL);
+
     while (t != NULL)
     {
-        assert(t->op == '|');
-        assert(t->left != NULL && t->left->cnfa.nstates > 0);
+        assert(t->cnfa.nstates > 0);

         MDEBUG(("%d: caltdissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));

-        d = getsubdfa(v, t->left);
+        d = getsubdfa(v, t);
         NOERR();
         if (longest(v, d, begin, end, (int *) NULL) == end)
         {
             MDEBUG(("%d: caltdissect matched\n", t->id));
-            er = cdissect(v, t->left, begin, end);
+            er = cdissect(v, t, begin, end);
             if (er != REG_NOMATCH)
                 return er;
         }
         NOERR();

-        t = t->right;
+        t = t->sibling;
     }

     return REG_NOMATCH;
@@ -1056,8 +1066,8 @@ citerdissect(struct vars *v,
     int            er;

     assert(t->op == '*');
-    assert(t->left != NULL && t->left->cnfa.nstates > 0);
-    assert(!(t->left->flags & SHORTER));
+    assert(t->child != NULL && t->child->cnfa.nstates > 0);
+    assert(!(t->child->flags & SHORTER));
     assert(begin <= end);

     MDEBUG(("%d: citerdissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));
@@ -1094,7 +1104,7 @@ citerdissect(struct vars *v,
         return REG_ESPACE;
     endpts[0] = begin;

-    d = getsubdfa(v, t->left);
+    d = getsubdfa(v, t->child);
     if (ISERR())
     {
         FREE(endpts);
@@ -1172,8 +1182,8 @@ citerdissect(struct vars *v,

         for (i = nverified + 1; i <= k; i++)
         {
-            zaptreesubs(v, t->left);
-            er = cdissect(v, t->left, endpts[i - 1], endpts[i]);
+            zaptreesubs(v, t->child);
+            er = cdissect(v, t->child, endpts[i - 1], endpts[i]);
             if (er == REG_OKAY)
             {
                 nverified = i;
@@ -1258,8 +1268,8 @@ creviterdissect(struct vars *v,
     int            er;

     assert(t->op == '*');
-    assert(t->left != NULL && t->left->cnfa.nstates > 0);
-    assert(t->left->flags & SHORTER);
+    assert(t->child != NULL && t->child->cnfa.nstates > 0);
+    assert(t->child->flags & SHORTER);
     assert(begin <= end);

     MDEBUG(("%d: creviterdissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));
@@ -1299,7 +1309,7 @@ creviterdissect(struct vars *v,
         return REG_ESPACE;
     endpts[0] = begin;

-    d = getsubdfa(v, t->left);
+    d = getsubdfa(v, t->child);
     if (ISERR())
     {
         FREE(endpts);
@@ -1383,8 +1393,8 @@ creviterdissect(struct vars *v,

         for (i = nverified + 1; i <= k; i++)
         {
-            zaptreesubs(v, t->left);
-            er = cdissect(v, t->left, endpts[i - 1], endpts[i]);
+            zaptreesubs(v, t->child);
+            er = cdissect(v, t->child, endpts[i - 1], endpts[i]);
             if (er == REG_OKAY)
             {
                 nverified = i;
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 82e761bfe5..90ee16957a 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -423,15 +423,17 @@ struct cnfa
  *        '='  plain regex without interesting substructure (implemented as DFA)
  *        'b'  back-reference (has no substructure either)
  *        '('  capture node: captures the match of its single child
- *        '.'  concatenation: matches a match for left, then a match for right
- *        '|'  alternation: matches a match for left or a match for right
+ *        '.'  concatenation: matches a match for first child, then second child
+ *        '|'  alternation: matches a match for any of its children
  *        '*'  iteration: matches some number of matches of its single child
  *
- * Note: the right child of an alternation must be another alternation or
- * NULL; hence, an N-way branch requires N alternation nodes, not N-1 as you
- * might expect.  This could stand to be changed.  Actually I'd rather see
- * a single alternation node with N children, but that will take revising
- * the representation of struct subre.
+ * An alternation node can have any number of children (but at least two),
+ * linked through their sibling fields.
+ *
+ * A concatenation node must have exactly two children.  It might be useful
+ * to support more, but that would complicate the executor.  Note that it is
+ * the first child's greediness that determines the node's preference for
+ * where to split a match.
  *
  * Note: when a backref is directly quantified, we stick the min/max counts
  * into the backref rather than plastering an iteration node on top.  This is
@@ -460,8 +462,8 @@ struct subre
                                  * LATYPE code for lookaround constraint */
     short        min;            /* min repetitions for iteration or backref */
     short        max;            /* max repetitions for iteration or backref */
-    struct subre *left;            /* left child, if any (also freelist chain) */
-    struct subre *right;        /* right child, if any */
+    struct subre *child;        /* first child, if any (also freelist chain) */
+    struct subre *sibling;        /* next child of same parent, if any */
     struct state *begin;        /* outarcs from here... */
     struct state *end;            /* ...ending in inarcs here */
     struct cnfa cnfa;            /* compacted NFA, if any */
diff --git a/src/backend/regex/README b/src/backend/regex/README
index cafeb3dffb..e4b083664f 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -130,8 +130,8 @@ possibilities.

 As an example, consider the regex "(a[bc]+)\1".  The compiled
 representation will have a top-level concatenation subre node.  Its first
-child is a capture node, and the child of that is a plain DFA node for
-"a[bc]+".  The concatenation's second child is a backref node for \1.
+child is a plain DFA node for "a[bc]+" (which is marked as being a capture
+node).  The concatenation's second child is a backref node for \1.
 The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
 where the backref has been replaced by a copy of the DFA for its referent
 expression.  When executed, the concatenation node will have to search for
@@ -147,6 +147,17 @@ run much faster than a pure NFA engine could do.  It is this behavior that
 justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
 library.

+It's perhaps worth noting that separate capture subre nodes are a rarity:
+normally, we just mark a subre as capturing and that's it.  However, it's
+legal to write a regex like "((x))" in which the same substring has to be
+captured by multiple sets of parentheses.  Since a subre has room for only
+one "capno" field, a single subre can't handle that.  We handle such cases
+by wrapping the base subre (which captures the innermost parens) in a
+no-op capture node, or even more than one for "(((x)))" etc.  This is a
+little bit inefficient because we end up with multiple identical NFAs,
+but since the case is pointless and infrequent, it's not worth working
+harder.
+

 Colors and colormapping
 -----------------------
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 4d483e7e53..891ad15b23 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -452,7 +452,7 @@ pg_regcomp(regex_t *re,
 #endif

         /* Prepend .* to pattern if it's a lookbehind LACON */
-        nfanode(v, lasub, !LATYPE_IS_AHEAD(lasub->subno), debug);
+        nfanode(v, lasub, !LATYPE_IS_AHEAD(lasub->latype), debug);
     }
     CNOERR();
     if (v->tree->flags & SHORTER)
@@ -944,7 +944,13 @@ parseqatom(struct vars *v,
             else
                 atomtype = PLAIN;    /* something that's not '(' */
             NEXT();
-            /* need new endpoints because tree will contain pointers */
+
+            /*
+             * Make separate endpoints to ensure we keep this sub-NFA cleanly
+             * separate from what surrounds it.  We need to be sure that when
+             * we duplicate the sub-NFA for a backref, we get the right states
+             * and no others.
+             */
             s = newstate(v->nfa);
             s2 = newstate(v->nfa);
             NOERR();
@@ -959,11 +965,21 @@ parseqatom(struct vars *v,
             {
                 assert(v->subs[subno] == NULL);
                 v->subs[subno] = atom;
-                t = subre(v, '(', atom->flags | CAP, lp, rp);
-                NOERR();
-                t->subno = subno;
-                t->child = atom;
-                atom = t;
+                if (atom->capno == 0)
+                {
+                    /* normal case: just mark the atom as capturing */
+                    atom->flags |= CAP;
+                    atom->capno = subno;
+                }
+                else
+                {
+                    /* generate no-op wrapper node to handle "((x))" */
+                    t = subre(v, '(', atom->flags | CAP, lp, rp);
+                    NOERR();
+                    t->capno = subno;
+                    t->child = atom;
+                    atom = t;
+                }
             }
             /* postpone everything else pending possible {0} */
             break;
@@ -976,7 +992,7 @@ parseqatom(struct vars *v,
             atom = subre(v, 'b', BACKR, lp, rp);
             NOERR();
             subno = v->nextvalue;
-            atom->subno = subno;
+            atom->backno = subno;
             EMPTYARC(lp, rp);    /* temporarily, so there's something */
             NEXT();
             break;
@@ -1276,8 +1292,10 @@ parseqatom(struct vars *v,
             freesubre(v, top->child);
             top->op = t->op;
             top->flags = t->flags;
+            top->latype = t->latype;
             top->id = t->id;
-            top->subno = t->subno;
+            top->capno = t->capno;
+            top->backno = t->backno;
             top->min = t->min;
             top->max = t->max;
             top->child = t->child;
@@ -1790,8 +1808,10 @@ subre(struct vars *v,

     ret->op = op;
     ret->flags = flags;
+    ret->latype = (char) -1;
     ret->id = 0;                /* will be assigned later */
-    ret->subno = 0;
+    ret->capno = 0;
+    ret->backno = 0;
     ret->min = ret->max = 1;
     ret->child = NULL;
     ret->sibling = NULL;
@@ -1893,7 +1913,7 @@ numst(struct subre *t,
     assert(t != NULL);

     i = start;
-    t->id = (short) i++;
+    t->id = i++;
     for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
         i = numst(t2, i);
     return i;
@@ -2040,7 +2060,7 @@ newlacon(struct vars *v,
     sub = &v->lacons[n];
     sub->begin = begin;
     sub->end = end;
-    sub->subno = latype;
+    sub->latype = latype;
     ZAPCNFA(sub->cnfa);
     return n;
 }
@@ -2163,7 +2183,7 @@ dump(regex_t *re,
         struct subre *lasub = &g->lacons[i];
         const char *latype;

-        switch (lasub->subno)
+        switch (lasub->latype)
         {
             case LATYPE_AHEAD_POS:
                 latype = "positive lookahead";
@@ -2227,8 +2247,12 @@ stdump(struct subre *t,
         fprintf(f, " hasbackref");
     if (!(t->flags & INUSE))
         fprintf(f, " UNUSED");
-    if (t->subno != 0)
-        fprintf(f, " (#%d)", t->subno);
+    if (t->latype != (char) -1)
+        fprintf(f, " latype(%d)", t->latype);
+    if (t->capno != 0)
+        fprintf(f, " capture(%d)", t->capno);
+    if (t->backno != 0)
+        fprintf(f, " backref(%d)", t->backno);
     if (t->min != 1 || t->max != 1)
     {
         fprintf(f, " {%d,", t->min);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 89d162ed6a..957ceb8137 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -825,12 +825,12 @@ lacon(struct vars *v,
     d = getladfa(v, n);
     if (d == NULL)
         return 0;
-    if (LATYPE_IS_AHEAD(sub->subno))
+    if (LATYPE_IS_AHEAD(sub->latype))
     {
         /* used to use longest() here, but shortest() could be much cheaper */
         end = shortest(v, d, cp, cp, v->stop,
                        (chr **) NULL, (int *) NULL);
-        satisfied = LATYPE_IS_POS(sub->subno) ? (end != NULL) : (end == NULL);
+        satisfied = LATYPE_IS_POS(sub->latype) ? (end != NULL) : (end == NULL);
     }
     else
     {
@@ -843,7 +843,7 @@ lacon(struct vars *v,
          * nominal match.
          */
         satisfied = matchuntil(v, d, cp, &v->lblastcss[n], &v->lblastcp[n]);
-        if (!LATYPE_IS_POS(sub->subno))
+        if (!LATYPE_IS_POS(sub->latype))
             satisfied = !satisfied;
     }
     FDEBUG(("=== lacon %d satisfied %d\n", n, satisfied));
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index 4541cf9a7e..2a1ef0176a 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -640,13 +640,11 @@ static void
 zaptreesubs(struct vars *v,
             struct subre *t)
 {
+    int            n = t->capno;
     struct subre *t2;

-    if (t->op == '(')
+    if (n > 0)
     {
-        int            n = t->subno;
-
-        assert(n > 0);
         if ((size_t) n < v->nmatch)
         {
             v->pmatch[n].rm_so = -1;
@@ -667,7 +665,7 @@ subset(struct vars *v,
        chr *begin,
        chr *end)
 {
-    int            n = sub->subno;
+    int            n = sub->capno;

     assert(n > 0);
     if ((size_t) n >= v->nmatch)
@@ -739,12 +737,10 @@ cdissect(struct vars *v,
             else
                 er = citerdissect(v, t, begin, end);
             break;
-        case '(':                /* capturing */
+        case '(':                /* no-op capture node */
             assert(t->child != NULL);
-            assert(t->subno > 0);
+            assert(t->capno > 0);
             er = cdissect(v, t->child, begin, end);
-            if (er == REG_OKAY)
-                subset(v, t, begin, end);
             break;
         default:
             er = REG_ASSERT;
@@ -758,6 +754,12 @@ cdissect(struct vars *v,
      */
     assert(er != REG_NOMATCH || (t->flags & BACKR));

+    /*
+     * If this node is marked as capturing, save successful match's location.
+     */
+    if (t->capno > 0 && er == REG_OKAY)
+        subset(v, t, begin, end);
+
     return er;
 }

@@ -932,7 +934,7 @@ cbrdissect(struct vars *v,
            chr *begin,            /* beginning of relevant substring */
            chr *end)            /* end of same */
 {
-    int            n = t->subno;
+    int            n = t->backno;
     size_t        numreps;
     size_t        tlen;
     size_t        brlen;
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 90ee16957a..306525eb5f 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -422,7 +422,7 @@ struct cnfa
  * "op" is one of:
  *        '='  plain regex without interesting substructure (implemented as DFA)
  *        'b'  back-reference (has no substructure either)
- *        '('  capture node: captures the match of its single child
+ *        '('  no-op capture node: captures the match of its single child
  *        '.'  concatenation: matches a match for first child, then second child
  *        '|'  alternation: matches a match for any of its children
  *        '*'  iteration: matches some number of matches of its single child
@@ -446,8 +446,8 @@ struct subre
 #define  LONGER  01                /* prefers longer match */
 #define  SHORTER 02                /* prefers shorter match */
 #define  MIXED     04                /* mixed preference below */
-#define  CAP     010            /* capturing parens below */
-#define  BACKR     020            /* back reference below */
+#define  CAP     010            /* capturing parens here or below */
+#define  BACKR     020            /* back reference here or below */
 #define  INUSE     0100            /* in use in final tree */
 #define  NOPROP  03                /* bits which may not propagate up */
 #define  LMIX(f) ((f)<<2)        /* LONGER -> MIXED */
@@ -457,9 +457,10 @@ struct subre
 #define  PREF(f) ((f)&NOPROP)
 #define  PREF2(f1, f2)     ((PREF(f1) != 0) ? PREF(f1) : PREF(f2))
 #define  COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2))
-    short        id;                /* ID of subre (1..ntree-1) */
-    int            subno;            /* subexpression number for 'b' and '(', or
-                                 * LATYPE code for lookaround constraint */
+    char        latype;            /* LATYPE code, if lookaround constraint */
+    int            id;                /* ID of subre (1..ntree-1) */
+    int            capno;            /* if capture node, subno to capture into */
+    int            backno;            /* if backref node, subno it refers to */
     short        min;            /* min repetitions for iteration or backref */
     short        max;            /* max repetitions for iteration or backref */
     struct subre *child;        /* first child, if any (also freelist chain) */

Re: Some regular-expression performance hacking

От

"Joel Jacobson"

Дата:

18 февраля 2021 г., 10:30:09

On Wed, Feb 17, 2021, at 22:00, Tom Lane wrote:

Attached is an updated patch series; it's rebased over 4e703d671
which took care of some not-really-related fixes, and I made a
pass of cleanup and comment improvements. I think this is pretty
much ready to commit, unless you want to do more testing or
code-reading.

I've produced a new dataset which now also includes the regex flags (if any) used for each subject applied to a pattern.

The new dataset contains 318364 patterns and 4474520 subjects.

(The old one had 235204 patterns and 1489489 subjects.)

I've tested the new dataset against PostgreSQL 10.16, 11.11, 12.6, 13.2, HEAD (4e703d671) and HEAD+patches.

I based the comparisons on the subjects that didn't cause an error on 13.2:

CREATE TABLE performance_test AS

SELECT

subjects.subject,

patterns.pattern,

patterns.flags,

tests.is_match,

tests.captured

FROM tests

JOIN subjects ON subjects.subject_id = tests.subject_id

JOIN patterns ON patterns.pattern_id = subjects.pattern_id

WHERE tests.error IS NULL

;

I then measured the query below for each PostgreSQL version:

\timing

SELECT version();

SELECT

is_match <> (subject ~ pattern) AS is_match_diff,

captured IS DISTINCT FROM regexp_match(subject, pattern, flags) AS captured_diff,

COUNT(*)

FROM performance_test

GROUP BY 1,2

ORDER BY 1,2

;

All versions produces the same result:

is_match_diff | captured_diff | count

---------------+---------------+---------

f | f | 3254769

(1 row)

Good! Not a single case that differs of over 3 million different regex pattern/subject combinations,

between five major PostgreSQL versions! That's a very stable regex engine.

To get a feeling for the standard deviation of the timings,

I executed the same query above three times for each PostgreSQL version:

PostgreSQL 10.16 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit

Time: 795674.830 ms (13:15.675)

Time: 794249.704 ms (13:14.250)

Time: 771036.707 ms (12:51.037)

PostgreSQL 11.11 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit

Time: 765466.191 ms (12:45.466)

Time: 787135.316 ms (13:07.135)

Time: 779582.635 ms (12:59.583)

PostgreSQL 12.6 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit

Time: 785500.516 ms (13:05.501)

Time: 784511.591 ms (13:04.512)

Time: 786727.973 ms (13:06.728)

PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit

Time: 758514.703 ms (12:38.515)

Time: 755883.600 ms (12:35.884)

Time: 746522.107 ms (12:26.522)

PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit

HEAD (4e703d671)

Time: 519620.646 ms (08:39.621)

Time: 518998.366 ms (08:38.998)

Time: 519696.129 ms (08:39.696)

HEAD (4e703d671)+0001+0002+0003+0004+0005

Time: 141290.329 ms (02:21.290)

Time: 141849.709 ms (02:21.850)

Time: 141630.819 ms (02:21.631)

That's a mind-blowing speed-up!

I also ran the more detailed test between 13.2 and HEAD+patches,

that also tests for differences in errors.

Like before, one similar improvement was found,

which previously resulted in an error, but now goes through OK:

SELECT * FROM vdeviations;

-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------

flags |

subject | www.aeroexpo.online

count | 1

a_server_version | 13.2

a_duration | 00:00:00.298253

a_is_match |

a_captured |

a_error | invalid regular expression: regular expression is too complex

b_server_version | 14devel

b_duration | 00:00:00.665958

b_is_match | t

b_captured | {online}

b_error |

Very nice.

I've uploaded the new dataset to the same place as before.

The schema for it can be found at https://github.com/truthly/regexes-in-the-wild

If anyone else would like a copy of the 715MB dataset, please let me know.

/Joel

Re: Some regular-expression performance hacking

От

"Joel Jacobson"

Дата:

18 февраля 2021 г., 11:04:55

On Thu, Feb 18, 2021, at 11:30, Joel Jacobson wrote:

>SELECT * FROM vdeviations;

>-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------

Heh, what a funny coincidence:

The regex I used to shrink the very-long-pattern,

actually happens to run a lot faster with the patches.

I noticed it when trying to read from the vdeviations view in PostgreSQL 13.2.

Here is my little helper-function which I used to shrink patterns/subjects longer than N characters:

CREATE OR REPLACE FUNCTION shrink_text(text,integer) RETURNS text LANGUAGE sql AS $$

SELECT CASE WHEN length($1) < $2 THEN $1 ELSE

format('%s ... %s chars ... %s', m[1], length(m[2]), m[3])

END

FROM (

SELECT regexp_matches($1,format('^(.{1,%1$s})(.*?)(.{1,%1$s})$',$2/2)) AS m

) AS q

$$;

The regex aims to produce three capture groups,

where I wanted the first and third ones to be greedy

and match up to $2 characters (controlled by the second input param to the function),

and the second capture group in the middle to be non-greedy,

but match the remainder to make up a fully anchored match.

It works like expected in both 13.2 and HEAD+patches, but the speed-up it enormous:

PostgreSQL 13.2:

EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

QUERY PLAN

-------------------------------------------------------------------------------------------------

ProjectSet (cost=0.00..0.02 rows=1 width=32) (actual time=23600.816..23600.838 rows=1 loops=1)

-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)

Planning Time: 0.432 ms

Execution Time: 23600.859 ms

(4 rows)

HEAD+0001+0002+0003+0004+0005:

EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

QUERY PLAN

-------------------------------------------------------------------------------------------

ProjectSet (cost=0.00..0.02 rows=1 width=32) (actual time=36.656..36.661 rows=1 loops=1)

-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.002 rows=1 loops=1)

Planning Time: 0.575 ms

Execution Time: 36.689 ms

(4 rows)

Cool stuff.

/Joel

Re: Some regular-expression performance hacking

От

Tom Lane

Дата:

18 февраля 2021 г., 18:10:47

"Joel Jacobson" <joel@compiler.org> writes:
>> I've produced a new dataset which now also includes the regex flags (if
>> any) used for each subject applied to a pattern.

Again, thanks for collecting this data!  I'm a little confused about
how you produced the results in the "tests" table, though.  It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well.  Even discounting
that, I'm not getting quite the same results, and I don't understand
why not.  So how was that made from the raw "patterns" and "subjects"
tables?

> PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
> Time: 758514.703 ms (12:38.515)
> Time: 755883.600 ms (12:35.884)
> Time: 746522.107 ms (12:26.522)
>
> PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
> HEAD (4e703d671)
> Time: 519620.646 ms (08:39.621)
> Time: 518998.366 ms (08:38.998)
> Time: 519696.129 ms (08:39.696)

Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD.  I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan).  Still, the next drop to circa 2:21 runtime is impressive enough
by itself.

> Heh, what a funny coincidence:
> The regex I used to shrink the very-long-pattern,
> actually happens to run a lot faster with the patches.

Yeah, that just happens to be a poster child for the MATCHALL idea:

> EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one.  That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.

            regards, tom lane

Re: Some regular-expression performance hacking

От

Tom Lane

Дата:

18 февраля 2021 г., 18:53:05

I thought it was worth looking a little more closely at the error
cases in this set of tests, as a form of competition analysis versus
Javascript's regex engine.  I ran through the cases that gave errors,
and pinned down exactly what was causing the error for as many cases
as I could.  (These results are from your first test corpus, but
I doubt the second one would give different conclusions.)

We have these errors reported in the test corpus:

               error               | count
-----------------------------------+-------
 invalid escape \ sequence         | 39141
 invalid character range           |   898
 invalid backreference number      |   816
 brackets [] not balanced          |   327
 invalid repetition count(s)       |    76
 quantifier operand invalid        |    17
 parentheses () not balanced       |     1
 regular expression is too complex |     1

The existing patchset takes care of the one "regular expression is too
complex" failure.  Of the rest:

It turns out that almost 39000 of the "invalid escape \ sequence"
errors are due to use of \D, \S, or \W within a character class.
We support the positive-class shorthands \d, \s, \w there, but not
their negations.  I think that this might be something that Henry
Spencer just never got around to; I don't see any fundamental reason
we can't allow it, although some refactoring might be needed in the
regex lexer.  Given the apparent popularity of this notation, maybe
we should put some work into that.

(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]".  It seems to me that that's
a very expensive and unwieldy way to spell ".".  Am I missing
something about what that does in Javascript?)

About half of the remaining escape-sequence complaints seem to be due
to just randomly backslashing alphanumeric characters that don't need
it, as for example "i" in "\itunes\.apple\.com".  Apparently
Javascript is content to take "\i" as just meaning "i".  Our engine
rejects that, with a view to keeping such combinations reserved for
future definition.  That's fine by me so I don't want to change it.

Of the rest, many are abbreviated numeric escapes, eg "\u45" where our
engine wants to see "\u0045".  I don't think being laxer about that
would be a great idea either.

Lastly, there are some occurrences like "[\1]", which in context look
like the \1 might be intended as a back-reference?  But I don't really
understand what that's supposed to do inside a bracket expression.

The "invalid character range" errors seem to be coming from constructs
like "[A-Za-z0-9-/]", which our engine rejects because it looks like
a messed-up character range.

All but 123 of the "invalid backreference number" complaints stem from
using backrefs inside lookahead constraints.  Some of the rest look
like they think you can put capturing parens inside a lookahead
constraint and then backref that.  I'm not really convinced that such
constructs have a well-defined meaning.  (I looked at the ECMAscript
definition of regexes, and they do say it's allowed, but when trying
to define it they resort to handwaving about backtracking; at best that
is a particularly lame version of specification by implementation.)
Spencer chose to forbid these cases in our engine, and I think there
are very good implementation reasons why it won't work.  Perhaps we
could provide a clearer error message about it, though.

307 of the "brackets [] not balanced" errors, as well as the one
"parentheses () not balanced" error, seem to trace to the fact that
Javascript considers "[]" to be a legal empty character class, whereas
POSIX doesn't allow empty character classes so our engine takes the
"]" literally, and then looks for a right bracket it won't find.
(That is, in POSIX "[]x]" is a character class matching ']' and 'x'.)
Maybe I'm misinterpreting this too, because if I read the
documentation correctly, "[]" in Javascript matches nothing, making
it impossible for the regex to succeed.  Why would such a construct
appear this often?

The remainder of the bracket errors happen because in POSIX, the
sequences "[:", "[=", and "[." within a bracket expression introduce
special syntax, whereas in Javascript '[' is just an ordinary data
character within a bracket expression.  Not much we can do here; the
standards are just incompatible.

All but 3 of the "invalid repetition count(s)" errors come from
quantifiers larger than our implementation limit of 255.  A lot of
those are exactly 256, though I saw one as high as 3000.  The
remaining 3 errors are from syntax like "[0-9]{0-3}", which is a
syntax error according to our engine ("[0-9]{0,3}" is correct).
AFAICT it's not a valid quantifier according to Javascript either;
perhaps that engine is just taking the "{0-3}" as literal text?

Given this, it seems like there's a fairly strong case for increasing
our repetition-count implementation limit, at least to 256, and maybe
1000 or so.  I'm hesitant to make the limit *really* large, but if
we can handle a regex containing thousands of "x"'s, it's not clear
why you shouldn't be able to write that as "x{0,1000}".

All of the "quantifier operand invalid" errors come from these
three patterns:
    ((?!\\)?\{0(?!\\)?\})
    ((?!\\)?\{1(?!\\)?\})
    class="(?!(tco-hidden|tco-display|tco-ellipsis))+.*?"|data-query-source=".*?"|dir=".*?"|rel=".*?"
which are evidently trying to apply a quantifier to a lookahead
constraint, which is just silly.

In short, a lot of this is from incompatible standards, or maybe
from varying ideas about whether to throw an error for invalid
constructs.  But I see a couple things we could improve.

            regards, tom lane

Re: Some regular-expression performance hacking

От

"Joel Jacobson"

Дата:

18 февраля 2021 г., 19:58:07

On Thu, Feb 18, 2021, at 19:10, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:
>> I've produced a new dataset which now also includes the regex flags (if
>> any) used for each subject applied to a pattern.

Again, thanks for collecting this data! I'm a little confused about
how you produced the results in the "tests" table, though. It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well.

That's exactly what I did. Some of the flags work the same between Javascript and PostgreSQL, others don't.

I thought maybe something interesting would surface in just trying them blindly.

Flags that aren't supported and gives errors are reported as tests where error is not null.

Most patterns have no flags, and second most popular is just the "i" flag, which should work the same.

SELECT flags, COUNT(*) FROM patterns GROUP BY 1 ORDER BY 2 DESC;

flags | count

-------+--------

| 151927

i | 120336

gi | 26057

g | 13263

gm | 4606

gim | 699

im | 491

y | 367

m | 365

gy | 105

u | 50

giy | 38

giu | 20

gimu | 14

iy | 11

iu | 6

gimy | 3

gu | 2

gmy | 2

imy | 1

my | 1

(21 rows)

This query shows what Javascript-regex-flags that could be used as-is without errors:

SELECT

patterns.flags,

COUNT(*)

FROM tests

JOIN subjects ON subjects.subject_id = tests.subject_id

JOIN patterns ON patterns.pattern_id = subjects.pattern_id

WHERE tests.error IS NULL

GROUP BY 1

ORDER BY 2;

flags | count

-------+---------

im | 2534

m | 4460

i | 543598

| 2704177

(4 rows)

I considered filtering/converting the flags to PostgreSQL,

maybe that would be an interesting approach to try as well.

Even discounting
that, I'm not getting quite the same results, and I don't understand
why not. So how was that made from the raw "patterns" and "subjects"
tables?

The rows in the tests table were generated by the create_regexp_tests() function [1]

Each subject now has a foreign key to a specific pattern,

where the (pattern, flags) combination are unique in patterns.

The actual unique constraint is on (pattern_hash, flags) to avoid

an index directly on pattern which can be huge as we've seen.

So, for each subject, it is known via the pattern_id

exactly what flags were used when the regex was compiled

(and later executed/applied with the subject).

[1] https://github.com/truthly/regexes-in-the-wild/blob/master/create_regexp_tests.sql

> PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
> Time: 758514.703 ms (12:38.515)
> Time: 755883.600 ms (12:35.884)
> Time: 746522.107 ms (12:26.522)
>
> PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
> HEAD (4e703d671)
> Time: 519620.646 ms (08:39.621)
> Time: 518998.366 ms (08:38.998)
> Time: 519696.129 ms (08:39.696)

Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD. I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan). Still, the next drop to circa 2:21 runtime is impressive enough
by itself.

OK. Another factor might perhaps be the PostgreSQL 10, 11, 12, 13 versions were compiled elsewhere,

I used the OS X binaries from https://postgresapp.com/, whereas version 14 I of course compiled myself.

Maybe I should have compiled 10, 11, 12, 13 myself instead, for a better comparison,

but I mostly just wanted to verify if I could find any differences, the performance comparison was a bonus.

> Heh, what a funny coincidence:
> The regex I used to shrink the very-long-pattern,
> actually happens to run a lot faster with the patches.

Yeah, that just happens to be a poster child for the MATCHALL idea:

> EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one. That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.

Very nice.

Like you said earlier, perhaps the regex engine has been optimized enough for this time.

If not, you want to investigate an additional idea,

that I think can be seen as a generalization of the optimization trick for (.*),

if I've understood how it works correctly.

Let's see if I can explain the idea:

One of the problems with representing regexes with large bracket range expressions, like [a-z],

is you get an explosion of edges, if the model can only represent state transitions for single characters.

If we could instead let a single edge (for a state transition) represent a set of characters,

or normally even more efficiently, a set of range of characters, then we could reduce the

number of edges we need to represent the graph.

The naive approach to just use the ranges as-is doesn't work.

Instead, the graph must first be created with single-character edges.

It is then examined what ranges can be constructed in a way that no single range

overlaps any other range, so that every range can be seen as a character in an alphabet.

Perhaps a bit of fiddling with some examples is easiest

to get a grip of the idea.

Here is a live demo of the idea:

https://compiler.org/reason-re-nfa/src/index.html

The graphs are rendered live when typing in the regex,

using a Javascript port of GraphViz.

For example, try entering the regex: t[a-z]*m

This generates this range-optimized graph for the regex:

/--[a-ln-su-z]-----------------\

|/------t--------------------\ |

|| | |

-->(0)--t-->({0,1})----m-------->({0 1 2}) | |

^---[a-ln-su-z]--/ | |

^-------t-------/ | |

^---------------------------/ |

^-----------------------------/

Notice how the [a-z] bracket expression has been split up,

and we now have 3 distinct set of "ranges":

[a-ln-su-z]

Since no ranges are overlapping, each such range can safely be seen as a letter in an alphabet.

Once we have our final graph, but before we proceed to generate the machine code for it,

we can shrink the graph further by merging ranges together, which eliminate some edges:

/--------------\

| |

--->(0)--t-->(1)<--[a-ln-z]--/

|^-[a-lnz]-\

\----m-->((2))<----\

| |

\---m---/

Notice how [a-ln-su-z]+t becomes [a-ln-z].

Another optimization I've come up with (or probably re-invented because it feels quite obvious),

is to read more than one character, when knowing for sure multiple characters-in-a-row

are expected, by concatenating edges having only one parent and one child.

In our example, we know for sure at least two characters will be read for the regex t[a-z]*m,

so with this optimization enabled, we get this graph:

/--[a-ln-z]

| |

--->(0)---t[a-ln-z]--->(1)<---+--[a-ln-z]

| | /

| \---m--->((2))<------\

\--------------tm------------^ | |

\----m----/

This makes not much difference for a few characters,

but if we have a long pattern with a long sentence

that is repeated, we could e.g. read in 32 bytes

and compare them all in one operation,

if our machine had 256-bits SIMD registers/instructions.

This idea has also been implemented in the online demo.

There is a level which can be adjusted

from 0 to 32 to control how many bytes to merge at most,

located in the "[+]dfa5 = merge_linear(dfa4)" step.

Anyway, I can totally understand if you've had enough of regex optimizations for this time,

but in case not, I wanted to share my work in this field, in case it's interesting to look at now or in the future.

/Joel

Re: Some regular-expression performance hacking

От

"Joel Jacobson"

Дата:

18 февраля 2021 г., 20:44:07

On Thu, Feb 18, 2021, at 20:58, Joel Jacobson wrote:

>Like you said earlier, perhaps the regex engine has been optimized enough for this time.

>If not, you want to investigate an additional idea,

In the above sentence, I meant "you _may_ want to".

I'm not at all sure these idea are applicable in the PostgreSQL regex engine,

so feel free to silently ignore these if you feel there is a risk for time waste.

>that I think can be seen as a generalization of the optimization trick for (.*),

>if I've understood how it works correctly.

Actually not sure if it can be seen as a generalization,

I just came to think of my ideas since they also improve

the case when you have lots of (.*) or bracket expressions of large ranges.

/Joel

Re: Some regular-expression performance hacking

От

Tom Lane

Дата:

18 февраля 2021 г., 20:44:47

"Joel Jacobson" <joel@compiler.org> writes:
> Let's see if I can explain the idea:
> One of the problems with representing regexes with large bracket range expressions, like [a-z],
> is you get an explosion of edges, if the model can only represent state transitions for single characters.
> If we could instead let a single edge (for a state transition) represent a set of characters,
> or normally even more efficiently, a set of range of characters, then we could reduce the
> number of edges we need to represent the graph.
> The naive approach to just use the ranges as-is doesn't work.
> Instead, the graph must first be created with single-character edges.
> It is then examined what ranges can be constructed in a way that no single range
> overlaps any other range, so that every range can be seen as a character in an alphabet.

Hmm ... I might be misunderstanding, but I think our engine already
does a version of this.  See the discussion of "colors" in
src/backend/regex/README.

> Another optimization I've come up with (or probably re-invented because it feels quite obvious),
> is to read more than one character, when knowing for sure multiple characters-in-a-row
> are expected, by concatenating edges having only one parent and one child.

Maybe.  In practice the actual scanning tends to be tracking more than one
possible NFA state in parallel, so I'm not sure how often we could expect
to be able to use this idea.  That is, even if we know that state X can
only succeed by following an arc to Y and then another to Z, we might
also be interested in what happens if the NFA is in state Q at this point;
and it seems unlikely that Q would have exactly the same two following
arc colors.

I do have some ideas about possible future optimizations, and one reason
I'm grateful for this large set of real regexes is that it can provide a
concrete basis for deciding that particular optimizations are or are not
worth pursuing.  So thanks again for collecting it!

            regards, tom lane

Re: Some regular-expression performance hacking

От

"Joel Jacobson"

Дата:

18 февраля 2021 г., 20:54:38

On Thu, Feb 18, 2021, at 21:44, Tom Lane wrote:

>Hmm ... I might be misunderstanding, but I think our engine already

>does a version of this. See the discussion of "colors" in

>src/backend/regex/README.

Thanks, I will read it with great interest.

>Maybe. In practice the actual scanning tends to be tracking more than one

>possible NFA state in parallel, so I'm not sure how often we could expect

>to be able to use this idea. That is, even if we know that state X can

>only succeed by following an arc to Y and then another to Z, we might

>also be interested in what happens if the NFA is in state Q at this point;

>and it seems unlikely that Q would have exactly the same two following

>arc colors.

Right. Actually I don't have a clear idea on how it could be implemented in an NFA engine.

>I do have some ideas about possible future optimizations, and one reason

>I'm grateful for this large set of real regexes is that it can provide a

>concrete basis for deciding that particular optimizations are or are not

>worth pursuing. So thanks again for collecting it!

My pleasure. Thanks for using it!

/Joel

Re: Some regular-expression performance hacking

От

"Joel Jacobson"

Дата:

19 февраля 2021 г., 12:45:34

On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:

>(Having said that, I can't help noticing that a very large fraction

>of those usages look like, eg, "[\w\W]". It seems to me that that's

>a very expensive and unwieldy way to spell ".". Am I missing

>something about what that does in Javascript?)

This popular regex

^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$

is coming from jQuery:

// A simple way to check for HTML strings

// Prioritize #id over <tag> to avoid XSS via location.hash (#9521)

// Strict HTML recognition (#11290: must start with <)

// Shortcut simple #id case for speed

rquickExpr = /^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$/,

From: https://code.jquery.com/jquery-3.5.1.js

I think this is a non-POSIX hack to match any character, including newlines,

which are not included unless the "s" flag is set.

Javascript test:

"foo\nbar".match(/(.+)/)[1];

"foo"

"foo\nbar".match(/(.+)/s)[1];

"foo

bar"

"foo\nbar".match(/([\w\W]+)/)[1];

"foo

bar"

/Joel

Re: Some regular-expression performance hacking

От

Tom Lane

Дата:

19 февраля 2021 г., 15:26:20

"Joel Jacobson" <joel@compiler.org> writes:
> On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
>> (Having said that, I can't help noticing that a very large fraction
>> of those usages look like, eg, "[\w\W]".  It seems to me that that's
>> a very expensive and unwieldy way to spell ".".  Am I missing
>> something about what that does in Javascript?)

> I think this is a non-POSIX hack to match any character, including newlines,
> which are not included unless the "s" flag is set.

> "foo\nbar".match(/([\w\W]+)/)[1];
> "foo
> bar"

Oooh, that's very interesting.   I guess the advantage of that over using
the 's' flag is that you can have different behaviors at different places
in the same regex.

I was just wondering about this last night in fact, while hacking on
the code to get it to accept \W etc in bracket expressions.  I see that
right now, our code thinks that NLSTOP mode ('n' switch, the opposite
of 's') should cause \W \D \S to not match newline.  That seems a little
weird, not least because \S should probably be different from the other
two, and it isn't.  And now we see it'd mean that you couldn't use the 'n'
switch to duplicate Javascript's default behavior in this area.  Should we
change it?  (I wonder what Perl does.)

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Some regular-expression performance hacking