Обсуждение: Pathological regexp match

Поиск
Список
Период
Сортировка

Pathological regexp match

От
Michael Glaesemann
Дата:
We came across a regexp that takes very much longer than expected.

PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit

SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity
?column?
----------t
(1 row)

Time: 90525.148 ms

The full query is below.

The same match in perl takes less than 0.01 seconds on the same hardware.

#!/bin/env perl
use warnings;
use strict;

my $sample = 'ooo...'; # omitted for email brevity

if ($sample =~ /Z(Q)[^Q]*A.*?(\1)/) {   print 'matches';
}
else {   print 'does not match';
}

This is a simplified version of a match that finally finished after 18 hours.

Given the nearly 4 orders of magnitude difference between the Perl script and the Postgres version, is there something
thatcould be improved in the Postgres regex engine? 

Cheers,

Michael Glaesemann
michael.glaesemann@myyearbook.com

SELECT
'ooooooooooQooQoQooooooooQoooooooooooooooooooooooooooooooooooooooQoooooQoooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooooooQoQoooooooooooooooooooooooooooooooooooooQoooooooooooooooooooooooooooooooooooQoooooooQooooooQooooZQoooooooooooAooooQoooooooQooooooooooooooooooooooQoooooooQooooooooooooooooooooooQoooooQoooooooooooAooooooooooooooooooooooooooooooooooooooooooooQooooooooQoQooooooooooooooooooooooooooQoQooooooQoooooooQoooooooQoooooooQooooooooooooooooooooooooooooooooooooooooooZQoooooooooooooooooQooQoQoooooooQooooooooooooooooooooooooooooooooooooooooooQoooooQoooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQoQooooooooooooooooZQoooooooooooooooooQooQoQooooooooooooQoQoooooooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooQoooooooooooooooooooooooooooooooooooooooQoooooQooooooooooooooooooooooooooooooooooooooooooooooQooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooooooQoQooooooooooooooooooooooooooooooZQoooooooooooooQooQoQooooooooooQoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooQoooooooooooooooooooooooooQoooooooQooooooooooooooooooooooooooooooooooooooQooooooooQoQoooooooooooooooooooooooooooooZQooooooooooooQooQoQooooooooooooooooooooooooooooooooooooooooooooooooooooooooZQooooooooooooooooooooooooooooooQooQoQoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQooooZQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooooQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooQooooooooooQoooooooQoooooooooooQoooooooooooooo'
~$r$Z(Q)[^Q]*A.*?(\1)$r$; 



Re: Pathological regexp match

От
Alvaro Herrera
Дата:
Hi Michael,

Michael Glaesemann wrote:
> We came across a regexp that takes very much longer than expected.
> 
> PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit
> 
> SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity

The ? after .* is pointless.  If you remove it, the query returns
immediately.

(There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Pathological regexp match

От
Michael Glaesemann
Дата:
On Jan 28, 2010, at 21:59 , Alvaro Herrera wrote:

> Hi Michael,
>
> Michael Glaesemann wrote:
>> We came across a regexp that takes very much longer than expected.
>>
>> PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc  
>> (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit
>>
>> SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email  
>> brevity
>
> The ? after .* is pointless.

Interesting. I would expect that *? would be the non-greedy version of  
*, meaning match up to the first \1 (in this case the first Q  
following A), rather than as much as possible.

For example, in Perl:
$ perl -e " if ('oooZQoooAoooQooQooQooo' =~ /Z(Q)[^Q]*A.*(\1)/)  
{ print \$&; } else { print 'NO'; }" && echo
ZQoooAoooQooQooQ
$ perl -e " if ('oooZQoooAoooQooQooQooo' =~ /Z(Q)[^Q]*A.*?(\1)/)  
{ print \$&; } else { print 'NO'; }" && echo
ZQoooAoooQ

If I'm reading the docs right, Postgres does support non-greedy * as *?:

<http://www.postgresql.org/docs/8.4/interactive/functions-matching.html#POSIX-QUANTIFIERS-TABLE >

However, as you point out, Postgres doesn't appear to take this into  
account:

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q) 
[^Q]*A.*(\2))$r$, $s$X$s$); regexp_replace
---------------- oooXooo
(1 row)

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q) 
[^Q]*A.*?(\2))$r$, $s$X$s$); regexp_replace
---------------- oooXooo
(1 row)

Michael Glaesemann
michael.glaesemann@myyearbook.com





Re: Pathological regexp match

От
Alvaro Herrera
Дата:
Michael Glaesemann wrote:
> 
> On Jan 28, 2010, at 21:59 , Alvaro Herrera wrote:
> 
> >Hi Michael,
> >
> >Michael Glaesemann wrote:
> >>We came across a regexp that takes very much longer than expected.
> >>
> >>PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC
> >>gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit
> >>
> >>SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email
> >>brevity
> >
> >The ? after .* is pointless.
> 
> Interesting. I would expect that *? would be the non-greedy version
> of *, meaning match up to the first \1 (in this case the first Q
> following A), rather than as much as possible.

Huh, you are right, *? is the non-greedy version.  I keep forgetting
those.  Note that they only work if you have regex_flavor set to
advanced, though (which is the default).


> However, as you point out, Postgres doesn't appear to take this into
> account:
> 
> postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
> [^Q]*A.*(\2))$r$, $s$X$s$);
>  regexp_replace
> ----------------
>  oooXooo
> (1 row)
> 
> postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
> [^Q]*A.*?(\2))$r$, $s$X$s$);
>  regexp_replace
> ----------------
>  oooXooo
> (1 row)

Hmm, that's strange ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Pathological regexp match

От
Alvaro Herrera
Дата:
Michael Glaesemann wrote:

> However, as you point out, Postgres doesn't appear to take this into
> account:
> 
> postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
> [^Q]*A.*(\2))$r$, $s$X$s$);
>  regexp_replace
> ----------------
>  oooXooo
> (1 row)
> 
> postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
> [^Q]*A.*?(\2))$r$, $s$X$s$);
>  regexp_replace
> ----------------
>  oooXooo
> (1 row)

I think the reason for this is that the first * is greedy and thus the
entire expression is considered greedy.  The fact that you've made the
second * non-greedy does not ungreedify the RE ... Note the docs say:
The above rules associate greediness attributes not only withindividual quantified atoms, but with branches and entire
REsthatcontain quantified atoms. What that means is that thematching is done in such a way that the branch, or whole
RE,matchesthe longest or shortest possible substring as a whole.
 

It's late here so I'm not sure if this is what you're looking for:

alvherre=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)[^Q]*?A.*(\2))$r$, $s$X$s$);regexp_replace 
----------------oooXooQooQooo
(1 fila)

(Obviously the non-greediness has moved somewhere else) :-(

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Pathological regexp match

От
Michael Glaesemann
Дата:
On Jan 28, 2010, at 23:21 , Alvaro Herrera wrote:

> I think the reason for this is that the first * is greedy and thus the
> entire expression is considered greedy.  The fact that you've made the
> second * non-greedy does not ungreedify the RE ... Note the docs say:
>
>     The above rules associate greediness attributes not only with
>     individual quantified atoms, but with branches and entire REs
>     that contain quantified atoms. What that means is that the
>     matching is done in such a way that the branch, or whole RE,
>     matches the longest or shortest possible substring as a whole.

Interesting. Thanks for pointing out this section of the docs. I  
wasn't aware of this twist.

> It's late here so I'm not sure if this is what you're looking for:

I'm not actually looking for a regexp that works: I was able to  
accomplish the task I had at hand with a different regexp. I'm just  
reporting the particular unexpected nastiness we ran into. :)

Michael Glaesemann
michael.glaesemann@myyearbook.com





Re: Pathological regexp match

От
Magnus Hagander
Дата:
2010/1/29 Alvaro Herrera <alvherre@commandprompt.com>:
> Hi Michael,
>
> Michael Glaesemann wrote:
>> We came across a regexp that takes very much longer than expected.
>>
>> PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit
>>
>> SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity
>
> The ? after .* is pointless.  If you remove it, the query returns
> immediately.
>
> (There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW)

Incidentally, I ran across the exact same issue with a non-greedy
regexp with a client earlier this week, and put on my TODO to figure
out a good place to stick a check for interrupts. Does this mean I
don't have to, because you're on it? ;)


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Pathological regexp match

От
Alvaro Herrera
Дата:
Magnus Hagander wrote:
> 2010/1/29 Alvaro Herrera <alvherre@commandprompt.com>:

> > (There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW)
>
> Incidentally, I ran across the exact same issue with a non-greedy
> regexp with a client earlier this week, and put on my TODO to figure
> out a good place to stick a check for interrupts. Does this mean I
> don't have to, because you're on it? ;)

No, sorry :-(

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Pathological regexp match

От
Tom Lane
Дата:
Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:
> We came across a regexp that takes very much longer than expected.

I poked into this a little bit.  What seems to be happening is that the
use of non-greedy quantifiers plus backreferences turns off most of the
optimization that the regexp engine usually does, leaving the RE as a
tree of possibilities that is explored in a fairly dumb fashion.  In
particular, there is a loop in crevdissect() that tries to locate a
feasible division point between two concatenated sub-patterns, and
for each try, a recursive call to crevdissect() tries to locate a
feasible division point between *its* two sub-patterns, resulting in
O(N^2) runtime.  With a not very pleasant constant factor, too, because
of repeated startup/shutdown costs for the DFA matching engine.

I found a possible patch that seems to improve matters: AFAICS the DFA
matching is independent of the checking that cdissect() and friends do,
so we can apply that check first instead of second.  This brings the
runtime down from minutes to sub-second on my machine.  However I don't
have a whole lot of faith either in the correctness of this change, or
that it might not make some other cases slower instead of faster.
Has anyone got a reasonably messy collection of regexps they'd like
to try this patch on?

BTW, I filed the problem upstream with the Tcl project
https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
but I'm not going to hold my breath waiting for useful advice from
them.  Since Henry Spencer dropped off the net, there doesn't seem
to be anybody there who understands that code much better than we do.

Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
in there ...

            regards, tom lane

Index: src/backend/regex/regexec.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/regex/regexec.c,v
retrieving revision 1.27
diff -c -r1.27 regexec.c
*** src/backend/regex/regexec.c    15 Oct 2005 02:49:24 -0000    1.27
--- src/backend/regex/regexec.c    30 Jan 2010 06:53:13 -0000
***************
*** 804,820 ****
      for (;;)
      {
          /* try this midpoint on for size */
!         er = cdissect(v, t->left, begin, mid);
!         if (er == REG_OKAY &&
!             longest(v, d2, mid, end, (int *) NULL) == end &&
!             (er = cdissect(v, t->right, mid, end)) ==
!             REG_OKAY)
!             break;                /* NOTE BREAK OUT */
!         if (er != REG_OKAY && er != REG_NOMATCH)
          {
!             freedfa(d);
!             freedfa(d2);
!             return er;
          }

          /* that midpoint didn't work, find a new one */
--- 804,821 ----
      for (;;)
      {
          /* try this midpoint on for size */
!         if (longest(v, d2, mid, end, (int *) NULL) == end)
          {
!             er = cdissect(v, t->left, begin, mid);
!             if (er == REG_OKAY &&
!                 (er = cdissect(v, t->right, mid, end)) == REG_OKAY)
!                 break;                /* NOTE BREAK OUT */
!             if (er != REG_OKAY && er != REG_NOMATCH)
!             {
!                 freedfa(d);
!                 freedfa(d2);
!                 return er;
!             }
          }

          /* that midpoint didn't work, find a new one */
***************
*** 904,920 ****
      for (;;)
      {
          /* try this midpoint on for size */
!         er = cdissect(v, t->left, begin, mid);
!         if (er == REG_OKAY &&
!             longest(v, d2, mid, end, (int *) NULL) == end &&
!             (er = cdissect(v, t->right, mid, end)) ==
!             REG_OKAY)
!             break;                /* NOTE BREAK OUT */
!         if (er != REG_OKAY && er != REG_NOMATCH)
          {
!             freedfa(d);
!             freedfa(d2);
!             return er;
          }

          /* that midpoint didn't work, find a new one */
--- 905,922 ----
      for (;;)
      {
          /* try this midpoint on for size */
!         if (longest(v, d2, mid, end, (int *) NULL) == end)
          {
!             er = cdissect(v, t->left, begin, mid);
!             if (er == REG_OKAY &&
!                 (er = cdissect(v, t->right, mid, end)) == REG_OKAY)
!                 break;                /* NOTE BREAK OUT */
!             if (er != REG_OKAY && er != REG_NOMATCH)
!             {
!                 freedfa(d);
!                 freedfa(d2);
!                 return er;
!             }
          }

          /* that midpoint didn't work, find a new one */

Re: Pathological regexp match

От
Magnus Hagander
Дата:
On Sat, Jan 30, 2010 at 08:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:
>> We came across a regexp that takes very much longer than expected.
>
> I poked into this a little bit.  What seems to be happening is that the
> use of non-greedy quantifiers plus backreferences turns off most of the
> optimization that the regexp engine usually does, leaving the RE as a
> tree of possibilities that is explored in a fairly dumb fashion.  In
> particular, there is a loop in crevdissect() that tries to locate a
> feasible division point between two concatenated sub-patterns, and
> for each try, a recursive call to crevdissect() tries to locate a
> feasible division point between *its* two sub-patterns, resulting in
> O(N^2) runtime.  With a not very pleasant constant factor, too, because
> of repeated startup/shutdown costs for the DFA matching engine.
>
> I found a possible patch that seems to improve matters: AFAICS the DFA
> matching is independent of the checking that cdissect() and friends do,
> so we can apply that check first instead of second.  This brings the
> runtime down from minutes to sub-second on my machine.  However I don't
> have a whole lot of faith either in the correctness of this change, or
> that it might not make some other cases slower instead of faster.
> Has anyone got a reasonably messy collection of regexps they'd like
> to try this patch on?

I only have the one, so I don't think I can help all that much with that.

> BTW, I filed the problem upstream with the Tcl project
> https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
> but I'm not going to hold my breath waiting for useful advice from
> them.  Since Henry Spencer dropped off the net, there doesn't seem
> to be anybody there who understands that code much better than we do.
>
> Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
> in there ...

Yeah. I have zero experience around that code, so if oyu have at least
some, I'd much appreciate it if you (or someone who does) could look
at it. Likely to cause a lot less breakage than me :D

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Pathological regexp match

От
Tom Lane
Дата:
I wrote:
> I found a possible patch that seems to improve matters: AFAICS the DFA
> matching is independent of the checking that cdissect() and friends do,
> so we can apply that check first instead of second.  This brings the
> runtime down from minutes to sub-second on my machine.  However I don't
> have a whole lot of faith either in the correctness of this change, or
> that it might not make some other cases slower instead of faster.
> Has anyone got a reasonably messy collection of regexps they'd like
> to try this patch on?

The Tcl folk accepted that patch, so I went ahead and applied it to
our code.  It would still be a good idea for us to do any testing we
can on it, though.

> Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
> in there ...

I looked at this a bit and decided that it would be messier than seems
justified in the absence of known problems.  The regex code uses malloc
not palloc for transient data structures, so we can't just stick a
CHECK_FOR_INTERRUPTS() into some inner loop hotspot --- throwing a
longjmp would mean a permanent memory leak.  I looked at integrating
into the error mechanism it does have, but it turns out that that's
not particularly designed to force immediate exit on failure either;
they just set a flag bit that will be tested whenever control does
return from the match.  I think that a good fix would require first
changing the mallocs to pallocs, then add CHECK_FOR_INTERRUPTS.
But that's not something I have time to mess with at the moment.
        regards, tom lane


Re: Pathological regexp match

От
Michael Glaesemann
Дата:
On Jan 31, 2010, at 22:14 , Tom Lane wrote:

> The Tcl folk accepted that patch, so I went ahead and applied it to
> our code.  It would still be a good idea for us to do any testing we
> can on it, though.

I applied the patch and ran both the test query I submitted as well as  
original problematic query that triggered the report, and it runs much  
faster. Thanks for the fix!

Michael Glaesemann
michael.glaesemann@myyearbook.com





Re: Pathological regexp match

От
Magnus Hagander
Дата:
2010/2/1 Michael Glaesemann <michael.glaesemann@myyearbook.com>:
>
> On Jan 31, 2010, at 22:14 , Tom Lane wrote:
>
>> The Tcl folk accepted that patch, so I went ahead and applied it to
>> our code.  It would still be a good idea for us to do any testing we
>> can on it, though.
>
> I applied the patch and ran both the test query I submitted as well as original problematic query that triggered the
report,and it runs much faster. Thanks for the fix! 

I did the same, and it does not help in my case. FWIW, the regexp I'm
matching is:
<pre .*?>(.*?)</pre>

(yes, the production system has already been fixed to use a smarter
regexp that solves the same problem)

The text is about 180Kb. PostgreSQL takes ~40 seconds without the
patch, ~36 seconds with it, to extract the match from it. Perl takes
0.016 seconds.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Pathological regexp match

От
"David E. Wheeler"
Дата:
On Feb 8, 2010, at 5:15 AM, Magnus Hagander wrote:

> The text is about 180Kb. PostgreSQL takes ~40 seconds without the
> patch, ~36 seconds with it, to extract the match from it. Perl takes
> 0.016 seconds.

Obviously we need to support Perl regular expressions in core. Not PCRE, but Perl. ;-P

Best,

David



Re: Pathological regexp match

От
Joachim Wieland
Дата:
On Mon, Feb 8, 2010 at 6:07 PM, David E. Wheeler <david@kineticode.com> wrote:
> On Feb 8, 2010, at 5:15 AM, Magnus Hagander wrote:
>
>> The text is about 180Kb. PostgreSQL takes ~40 seconds without the
>> patch, ~36 seconds with it, to extract the match from it. Perl takes
>> 0.016 seconds.
>
> Obviously we need to support Perl regular expressions in core. Not PCRE, but Perl. ;-P

FWIW, PCRE is BSD licensed, so we could actually include it... Would
be interesting to see how it performs on the pattern at hand.


Joachim