Fix regex back-references that are directly quantified with *.
The syntax "\n*", that is a backref with a * quantifier directly applied
to it, has never worked correctly in Spencer's library. This has been an
open bug in the Tcl bug tracker since 2005:
https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894
The core of the problem is in parseqatom(), which first changes "\n*" to
"\n+|" and then applies repeat() to the NFA representing the backref atom.
repeat() thinks that any arc leading into its "rp" argument is part of the
sub-NFA to be repeated. Unfortunately, since parseqatom() already created
the arc that was intended to represent the empty bypass around "\n+", this
arc gets moved too, so that it now leads into the state loop created by
repeat(). Thus, what was supposed to be an "empty" bypass gets turned into
something that represents zero or more repetitions of the NFA representing
the backref atom. In the original example, in place of
^([bc])\1*$
we now have something that acts like
^([bc])(\1+|[bc]*)$
At runtime, the branch involving the actual backref fails, as it's supposed
to, but then the other branch succeeds anyway.
We could no doubt fix this by some rearrangement of the operations in
parseqatom(), but that code is plenty ugly already, and what's more the
whole business of converting "x*" to "x+|" probably needs to go away to fix
another problem I'll mention in a moment. Instead, this patch suppresses
the *-conversion when the target is a simple backref atom, leaving the case
of m == 0 to be handled at runtime. This makes the patch in regcomp.c a
one-liner, at the cost of having to tweak cbrdissect() a little. In the
event I went a bit further than that and rewrote cbrdissect() to check all
the string-length-related conditions before it starts comparing characters.
It seems a bit stupid to possibly iterate through many copies of an
n-character backreference, only to fail at the end because the target
string's length isn't a multiple of n --- we could have found that out
before starting. The existing coding could only be a win if integer
division is hugely expensive compared to character comparison, but I don't
know of any modern machine where that might be true.
This does not fix all the problems with quantified back-references. In
particular, the code is still broken for back-references that appear within
a larger expression that is quantified (so that direct insertion of the
quantification limits into the BACKREF node doesn't apply). I think fixing
that will take some major surgery on the NFA code, specifically introducing
an explicit iteration node type instead of trying to transform iteration
into concatenation of modified regexps.
Back-patch to all supported branches. In HEAD, also add a regression test
case for this. (It may seem a bit silly to create a regression test file
for just one test case; but I'm expecting that we will soon import a whole
bunch of regex regression tests from Tcl, so might as well create the
infrastructure now.)
2012-02-20 06:52:33 +01:00
|
|
|
--
|
|
|
|
-- Regular expression tests
|
|
|
|
--
|
|
|
|
-- Don't want to have to double backslashes in regexes
|
|
|
|
set standard_conforming_strings = on;
|
|
|
|
-- Test simple quantified backrefs
|
|
|
|
select 'bbbbb' ~ '^([bc])\1*$' as t;
|
|
|
|
t
|
|
|
|
---
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'ccc' ~ '^([bc])\1*$' as t;
|
|
|
|
t
|
|
|
|
---
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'xxx' ~ '^([bc])\1*$' as f;
|
|
|
|
f
|
|
|
|
---
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'bbc' ~ '^([bc])\1*$' as f;
|
|
|
|
f
|
|
|
|
---
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'b' ~ '^([bc])\1*$' as t;
|
|
|
|
t
|
|
|
|
---
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
2012-02-24 07:40:18 +01:00
|
|
|
-- Test quantified backref within a larger expression
|
|
|
|
select 'abc abc abc' ~ '^(\w+)( \1)+$' as t;
|
|
|
|
t
|
|
|
|
---
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'abc abd abc' ~ '^(\w+)( \1)+$' as f;
|
|
|
|
f
|
|
|
|
---
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'abc abc abd' ~ '^(\w+)( \1)+$' as f;
|
|
|
|
f
|
|
|
|
---
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'abc abc abc' ~ '^(.+)( \1)+$' as t;
|
|
|
|
t
|
|
|
|
---
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'abc abd abc' ~ '^(.+)( \1)+$' as f;
|
|
|
|
f
|
|
|
|
---
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'abc abc abd' ~ '^(.+)( \1)+$' as f;
|
|
|
|
f
|
|
|
|
---
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
2012-05-24 19:56:16 +02:00
|
|
|
-- Test some cases that crashed in 9.2beta1 due to pmatch[] array overrun
|
|
|
|
select substring('asd TO foo' from ' TO (([a-z0-9._]+|"([^"]+|"")+")+)');
|
|
|
|
substring
|
|
|
|
-----------
|
|
|
|
foo
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select substring('a' from '((a))+');
|
|
|
|
substring
|
|
|
|
-----------
|
|
|
|
a
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select substring('a' from '((a)+)');
|
|
|
|
substring
|
|
|
|
-----------
|
|
|
|
a
|
|
|
|
(1 row)
|
|
|
|
|
2016-08-18 00:32:56 +02:00
|
|
|
-- Test regexp_match()
|
|
|
|
select regexp_match('abc', '');
|
|
|
|
regexp_match
|
|
|
|
--------------
|
|
|
|
{""}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_match('abc', 'bc');
|
|
|
|
regexp_match
|
|
|
|
--------------
|
|
|
|
{bc}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_match('abc', 'd') is null;
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_match('abc', '(B)(c)', 'i');
|
|
|
|
regexp_match
|
|
|
|
--------------
|
|
|
|
{b,c}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_match('abc', 'Bd', 'ig'); -- error
|
|
|
|
ERROR: regexp_match does not support the global option
|
|
|
|
HINT: Use the regexp_matches function instead.
|
Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point. This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked. In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall. To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops. This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space. So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine. This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely. So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares. (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
2015-10-31 00:14:19 +01:00
|
|
|
-- Test lookahead constraints
|
|
|
|
select regexp_matches('ab', 'a(?=b)b*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{ab}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('a', 'a(?=b)b*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
(0 rows)
|
|
|
|
|
|
|
|
select regexp_matches('abc', 'a(?=b)b*(?=c)c*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{abc}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('ab', 'a(?=b)b*(?=c)c*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
(0 rows)
|
|
|
|
|
|
|
|
select regexp_matches('ab', 'a(?!b)b*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
(0 rows)
|
|
|
|
|
|
|
|
select regexp_matches('a', 'a(?!b)b*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{a}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('b', '(?=b)b');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{b}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('a', '(?=b)b');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
(0 rows)
|
|
|
|
|
|
|
|
-- Test lookbehind constraints
|
|
|
|
select regexp_matches('abb', '(?<=a)b*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{bb}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('a', 'a(?<=a)b*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{a}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('abc', 'a(?<=a)b*(?<=b)c*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{abc}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('ab', 'a(?<=a)b*(?<=b)c*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{ab}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('ab', 'a*(?<!a)b*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{""}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('ab', 'a*(?<!a)b+');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
(0 rows)
|
|
|
|
|
|
|
|
select regexp_matches('b', 'a*(?<!a)b+');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{b}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('a', 'a(?<!a)b*');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
(0 rows)
|
|
|
|
|
|
|
|
select regexp_matches('b', '(?<=b)b');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
(0 rows)
|
|
|
|
|
|
|
|
select regexp_matches('foobar', '(?<=f)b+');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
(0 rows)
|
|
|
|
|
|
|
|
select regexp_matches('foobar', '(?<=foo)b+');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{b}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('foobar', '(?<=oo)b+');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{b}
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
-- Test optimization of single-chr-or-bracket-expression lookaround constraints
|
|
|
|
select 'xz' ~ 'x(?=[xy])';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'xy' ~ 'x(?=[xy])';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'xz' ~ 'x(?![xy])';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'xy' ~ 'x(?![xy])';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'x' ~ 'x(?![xy])';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'xyy' ~ '(?<=[xy])yy+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'zyy' ~ '(?<=[xy])yy+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'xyy' ~ '(?<![xy])yy+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'zyy' ~ '(?<![xy])yy+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
2012-07-10 20:54:37 +02:00
|
|
|
-- Test conversion of regex patterns to indexable conditions
|
|
|
|
explain (costs off) select * from pg_proc where proname ~ 'abc';
|
|
|
|
QUERY PLAN
|
|
|
|
-----------------------------------
|
|
|
|
Seq Scan on pg_proc
|
|
|
|
Filter: (proname ~ 'abc'::text)
|
|
|
|
(2 rows)
|
|
|
|
|
|
|
|
explain (costs off) select * from pg_proc where proname ~ '^abc';
|
|
|
|
QUERY PLAN
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Index Scan using pg_proc_proname_args_nsp_index on pg_proc
|
|
|
|
Index Cond: ((proname >= 'abc'::name) AND (proname < 'abd'::name))
|
|
|
|
Filter: (proname ~ '^abc'::text)
|
|
|
|
(3 rows)
|
|
|
|
|
|
|
|
explain (costs off) select * from pg_proc where proname ~ '^abc$';
|
|
|
|
QUERY PLAN
|
|
|
|
------------------------------------------------------------
|
|
|
|
Index Scan using pg_proc_proname_args_nsp_index on pg_proc
|
|
|
|
Index Cond: (proname = 'abc'::name)
|
|
|
|
Filter: (proname ~ '^abc$'::text)
|
|
|
|
(3 rows)
|
|
|
|
|
|
|
|
explain (costs off) select * from pg_proc where proname ~ '^abcd*e';
|
|
|
|
QUERY PLAN
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Index Scan using pg_proc_proname_args_nsp_index on pg_proc
|
|
|
|
Index Cond: ((proname >= 'abc'::name) AND (proname < 'abd'::name))
|
|
|
|
Filter: (proname ~ '^abcd*e'::text)
|
|
|
|
(3 rows)
|
|
|
|
|
|
|
|
explain (costs off) select * from pg_proc where proname ~ '^abc+d';
|
|
|
|
QUERY PLAN
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Index Scan using pg_proc_proname_args_nsp_index on pg_proc
|
|
|
|
Index Cond: ((proname >= 'abc'::name) AND (proname < 'abd'::name))
|
|
|
|
Filter: (proname ~ '^abc+d'::text)
|
|
|
|
(3 rows)
|
|
|
|
|
|
|
|
explain (costs off) select * from pg_proc where proname ~ '^(abc)(def)';
|
|
|
|
QUERY PLAN
|
|
|
|
----------------------------------------------------------------------------
|
|
|
|
Index Scan using pg_proc_proname_args_nsp_index on pg_proc
|
|
|
|
Index Cond: ((proname >= 'abcdef'::name) AND (proname < 'abcdeg'::name))
|
|
|
|
Filter: (proname ~ '^(abc)(def)'::text)
|
|
|
|
(3 rows)
|
|
|
|
|
|
|
|
explain (costs off) select * from pg_proc where proname ~ '^(abc)$';
|
|
|
|
QUERY PLAN
|
|
|
|
------------------------------------------------------------
|
|
|
|
Index Scan using pg_proc_proname_args_nsp_index on pg_proc
|
|
|
|
Index Cond: (proname = 'abc'::name)
|
|
|
|
Filter: (proname ~ '^(abc)$'::text)
|
|
|
|
(3 rows)
|
|
|
|
|
|
|
|
explain (costs off) select * from pg_proc where proname ~ '^(abc)?d';
|
|
|
|
QUERY PLAN
|
|
|
|
----------------------------------------
|
|
|
|
Seq Scan on pg_proc
|
|
|
|
Filter: (proname ~ '^(abc)?d'::text)
|
|
|
|
(2 rows)
|
|
|
|
|
2015-10-19 22:54:53 +02:00
|
|
|
explain (costs off) select * from pg_proc where proname ~ '^abcd(x|(?=\w\w)q)';
|
|
|
|
QUERY PLAN
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
Index Scan using pg_proc_proname_args_nsp_index on pg_proc
|
|
|
|
Index Cond: ((proname >= 'abcd'::name) AND (proname < 'abce'::name))
|
|
|
|
Filter: (proname ~ '^abcd(x|(?=\w\w)q)'::text)
|
|
|
|
(3 rows)
|
|
|
|
|
2013-03-07 17:51:03 +01:00
|
|
|
-- Test for infinite loop in pullback() (CVE-2007-4772)
|
|
|
|
select 'a' ~ '($|^)*';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
Fix regular-expression compiler to handle loops of constraint arcs.
It's possible to construct regular expressions that contain loops of
constraint arcs (that is, ^ $ AHEAD BEHIND or LACON arcs). There's no use
in fully traversing such a loop at execution, since you'd just end up in
the same NFA state without having consumed any input. Worse, such a loop
leads to infinite looping in the pullback/pushfwd stage of compilation,
because we keep pushing or pulling the same constraints around the loop
in a vain attempt to move them to the pre or post state. Such looping was
previously recognized in CVE-2007-4772; but the fix only handled the case
of trivial single-state loops (that is, a constraint arc leading back to
its source state) ... and not only that, it was incorrect even for that
case, because it broke the admittedly-not-very-clearly-stated API contract
of the pull() and push() subroutines. The first two regression test cases
added by this commit exhibit patterns that result in assertion failures
because of that (though there seem to be no ill effects in non-assert
builds). The other new test cases exhibit multi-state constraint loops;
in an unpatched build they will run until the NFA state-count limit is
exceeded.
To fix, remove the code added for CVE-2007-4772, and instead create a
general-purpose constraint-loop-breaking phase of regex compilation that
executes before we do pullback/pushfwd. Since we never need to traverse
a constraint loop fully, we can just break the loop at any chosen spot,
if we add clone states that can replicate any sequence of arc transitions
that would've traversed just part of the loop.
Also add some commentary clarifying why we have to have all these
machinations in the first place.
This class of problems has been known for some time --- we had a report
from Marc Mamin about two years ago, for example, and there are related
complaints in the Tcl bug tracker. I had discussed a fix of this kind
off-list with Henry Spencer, but didn't get around to doing something
about it until the issue was rediscovered by Greg Stark recently.
Back-patch to all supported branches.
2015-10-16 20:14:40 +02:00
|
|
|
-- These cases expose a bug in the original fix for CVE-2007-4772
|
|
|
|
select 'a' ~ '(^)+^';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'a' ~ '$($$)+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
-- More cases of infinite loop in pullback(), not fixed by CVE-2007-4772 fix
|
|
|
|
select 'a' ~ '($^)+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'a' ~ '(^$)*';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'aa bb cc' ~ '(^(?!aa))+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'aa x' ~ '(^(?!aa)(?!bb)(?!cc))+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'bb x' ~ '(^(?!aa)(?!bb)(?!cc))+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'cc x' ~ '(^(?!aa)(?!bb)(?!cc))+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'dd x' ~ '(^(?!aa)(?!bb)(?!cc))+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
2013-03-07 17:51:03 +01:00
|
|
|
-- Test for infinite loop in fixempties() (Tcl bugs 3604074, 3606683)
|
|
|
|
select 'a' ~ '((((((a)*)*)*)*)*)*';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
2015-10-16 21:52:12 +02:00
|
|
|
-- These cases used to give too-many-states failures
|
|
|
|
select 'x' ~ 'abcd(\m)+xyz';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'a' ~ '^abcd*(((((^(a c(e?d)a+|)+|)+|)+|)+|a)+|)';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'x' ~ 'a^(^)bcd*xy(((((($a+|)+|)+|)+$|)+|)+|)^$';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'x' ~ 'xyz(\Y\Y)+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'x' ~ 'x|(?:\M)+';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
-- This generates O(N) states but O(N^2) arcs, so it causes problems
|
|
|
|
-- if arc count is not constrained
|
|
|
|
select 'x' ~ repeat('x*y*z*', 1000);
|
|
|
|
ERROR: invalid regular expression: regular expression is too complex
|
2013-07-19 03:22:37 +02:00
|
|
|
-- Test backref in combination with non-greedy quantifier
|
|
|
|
-- https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0
|
|
|
|
select 'Programmer' ~ '(\w).*?\1' as t;
|
|
|
|
t
|
|
|
|
---
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select regexp_matches('Programmer', '(\w)(.*?\1)', 'g');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{r,ogr}
|
|
|
|
{m,m}
|
|
|
|
(2 rows)
|
|
|
|
|
2014-09-24 02:25:31 +02:00
|
|
|
-- Test for proper matching of non-greedy iteration (bug #11478)
|
|
|
|
select regexp_matches('foo/bar/baz',
|
|
|
|
'^([^/]+?)(?:/([^/]+?))(?:/([^/]+?))?$', '');
|
|
|
|
regexp_matches
|
|
|
|
----------------
|
|
|
|
{foo,bar,baz}
|
|
|
|
(1 row)
|
|
|
|
|
Fix potential infinite loop in regular expression execution.
In cfindloop(), if the initial call to shortest() reports that a
zero-length match is possible at the current search start point, but then
it is unable to construct any actual match to that, it'll just loop around
with the same start point, and thus make no progress. We need to force the
start point to be advanced. This is safe because the loop over "begin"
points has already tried and failed to match starting at "close", so there
is surely no need to try that again.
This bug was introduced in commit e2bd904955e2221eddf01110b1f25002de2aaa83,
wherein we allowed continued searching after we'd run out of match
possibilities, but evidently failed to think hard enough about exactly
where we needed to search next.
Because of the way this code works, such a match failure is only possible
in the presence of backrefs --- otherwise, shortest()'s judgment that a
match is possible should always be correct. That probably explains how
come the bug has escaped detection for several years.
The actual fix is a one-liner, but I took the trouble to add/improve some
comments related to the loop logic.
After fixing that, the submitted test case "()*\1" didn't loop anymore.
But it reported failure, though it seems like it ought to match a
zero-length string; both Tcl and Perl think it does. That seems to be from
overenthusiastic optimization on my part when I rewrote the iteration match
logic in commit 173e29aa5deefd9e71c183583ba37805c8102a72: we can't just
"declare victory" for a zero-length match without bothering to set match
data for capturing parens inside the iterator node.
Per fuzz testing by Greg Stark. The first part of this is a bug in all
supported branches, and the second part is a bug since 9.2 where the
iteration rewrite happened.
2015-10-02 20:26:36 +02:00
|
|
|
-- Test for infinite loop in cfindloop with zero-length possible match
|
|
|
|
-- but no actual match (can only happen in the presence of backrefs)
|
|
|
|
select 'a' ~ '$()|^\1';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'a' ~ '.. ()|\1';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'a' ~ '()*\1';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
select 'a' ~ '()+\1';
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
|
2015-11-07 18:43:24 +01:00
|
|
|
-- Error conditions
|
|
|
|
select 'xyz' ~ 'x(\w)(?=\1)'; -- no backrefs in LACONs
|
|
|
|
ERROR: invalid regular expression: invalid backreference number
|
|
|
|
select 'xyz' ~ 'x(\w)(?=(\1))';
|
|
|
|
ERROR: invalid regular expression: invalid backreference number
|
Fix some regex issues with out-of-range characters and large char ranges.
Previously, our regex code defined CHR_MAX as 0xfffffffe, which is a
bad choice because it is outside the range of type "celt" (int32).
Characters approaching that limit could lead to infinite loops in logic
such as "for (c = a; c <= b; c++)" where c is of type celt but the
range bounds are chr. Such loops will work safely only if CHR_MAX+1
is representable in celt, since c must advance to beyond b before the
loop will exit.
Fortunately, there seems no reason not to restrict CHR_MAX to 0x7ffffffe.
It's highly unlikely that Unicode will ever assign codes that high, and
none of our other backend encodings need characters beyond that either.
In addition to modifying the macro, we have to explicitly enforce character
range restrictions on the values of \u, \U, and \x escape sequences, else
the limit is trivially bypassed.
Also, the code for expanding case-independent character ranges in bracket
expressions had a potential integer overflow in its calculation of the
number of characters it could generate, which could lead to allocating too
small a character vector and then overwriting memory. An attacker with the
ability to supply arbitrary regex patterns could easily cause transient DOS
via server crashes, and the possibility for privilege escalation has not
been ruled out.
Quite aside from the integer-overflow problem, the range expansion code was
unnecessarily inefficient in that it always produced a result consisting of
individual characters, abandoning the knowledge that we had a range to
start with. If the input range is large, this requires excessive memory.
Change it so that the original range is reported as-is, and then we add on
any case-equivalent characters that are outside that range. With this
approach, we can bound the number of individual characters allowed without
sacrificing much. This patch allows at most 100000 individual characters,
which I believe to be more than the number of case pairs existing in
Unicode, so that the restriction will never be hit in practice.
It's still possible for range() to take awhile given a large character code
range, so also add statement-cancel detection to its loop. The downstream
function dovec() also lacked cancel detection, and could take a long time
given a large output from range().
Per fuzz testing by Greg Stark. Back-patch to all supported branches.
Security: CVE-2016-0773
2016-02-08 16:25:40 +01:00
|
|
|
select 'a' ~ '\x7fffffff'; -- invalid chr code
|
|
|
|
ERROR: invalid regular expression: invalid escape \ sequence
|