Miscellaneous cleanup of regular-expression compiler.

Revert our previous addition of "all" flags to copyins() and copyouts();
they're no longer needed, and were never anything but an unsightly hack.

Improve a couple of infelicities in the REG_DEBUG code for dumping
the NFA data structure, including adding code to count the total
number of states and arcs.

Add a couple of missed error checks.

Add some more documentation in the README file, and some regression tests
illustrating cases that exceeded the state-count limit and/or took
unreasonable amounts of time before this set of patches.

Back-patch to all supported branches.
This commit is contained in:
Tom Lane 2015-10-16 15:52:12 -04:00
parent 538b3b8b35
commit afdfcd3f76
5 changed files with 153 additions and 54 deletions

View File

@ -76,11 +76,10 @@ relates to what you'll see in the code. Here's what really happens:
of states approximately proportional to the length of the regexp.
* The NFA is then optimized into a "compact NFA" representation, which is
basically the same data but without fields that are not going to be needed
at runtime. We do a little bit of cleanup too, such as removing
unreachable states that might be created as a result of the rather naive
transformation done by initial parsing. The cNFA representation is what
is passed from regcomp to regexec.
basically the same idea but without fields that are not going to be needed
at runtime. It is simplified too: the compact format only allows "plain"
and "LACON" arc types. The cNFA representation is what is passed from
regcomp to regexec.
* Unlike traditional NFA-based regex engines, we do not execute directly
from the NFA representation, as that would require backtracking and so be
@ -139,12 +138,13 @@ a possible division of the input string that allows its two child nodes to
each match their part of the string (and although this specific case can
only succeed when the division is at the middle, the code does not know
that, nor would it be true in general). However, we can first run the DFA
and quickly reject any input that doesn't contain two a's and some number
of b's and c's. If the DFA doesn't match, there is no need to recurse to
the two child nodes for each possible string division point. In many
cases, this prefiltering makes the search run much faster than a pure NFA
engine could do. It is this behavior that justifies using the phrase
"hybrid DFA/NFA engine" to describe Spencer's library.
and quickly reject any input that doesn't start with an "a" and contain
one more "a" plus some number of b's and c's. If the DFA doesn't match,
there is no need to recurse to the two child nodes for each possible
string division point. In many cases, this prefiltering makes the search
run much faster than a pure NFA engine could do. It is this behavior that
justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
library.
Colors and colormapping
@ -296,3 +296,76 @@ character classes are somehow processed "symbolically" without making a
full expansion of their contents at parse time. This would mean that we'd
have to be ready to call iswalpha() at runtime, but if that only happens
for high-code-value characters, it shouldn't be a big performance hit.
Detailed semantics of an NFA
----------------------------
When trying to read dumped-out NFAs, it's helpful to know these facts:
State 0 (additionally marked with "@" in dumpnfa's output) is always the
goal state, and state 1 (additionally marked with ">") is the start state.
(The code refers to these as the post state and pre state respectively.)
The possible arc types are:
PLAIN arcs, which specify matching of any character of a given "color"
(see above). These are dumped as "[color_number]->to_state".
EMPTY arcs, which specify a no-op transition to another state. These
are dumped as "->to_state".
AHEAD constraints, which represent a "next character must be of this
color" constraint. AHEAD differs from a PLAIN arc in that the input
character is not consumed when crossing the arc. These are dumped as
">color_number>->to_state".
BEHIND constraints, which represent a "previous character must be of
this color" constraint, which likewise consumes no input. These are
dumped as "<color_number<->to_state".
'^' arcs, which specify a beginning-of-input constraint. These are
dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
beginning-of-line constraints respectively.
'$' arcs, which specify an end-of-input constraint. These are dumped
as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line
constraints respectively.
LACON constraints, which represent "(?=re)" and "(?!re)" constraints,
i.e. the input starting at this point must match (or not match) a
given sub-RE, but the matching input is not consumed. These are
dumped as ":subtree_number:->to_state".
If you see anything else (especially any question marks) in the display of
an arc, it's dumpnfa() trying to tell you that there's something fishy
about the arc; see the source code.
The regex executor can only handle PLAIN and LACON transitions. The regex
optimize() function is responsible for transforming the parser's output
to get rid of all the other arc types. In particular, ^ and $ arcs that
are not dropped as impossible will always end up adjacent to the pre or
post state respectively, and then will be converted into PLAIN arcs that
mention the special "colors" for BOS, BOL, EOS, or EOL.
To decide whether a thus-transformed NFA matches a given substring of the
input string, the executor essentially follows these rules:
1. Start the NFA "looking at" the character *before* the given substring,
or if the substring is at the start of the input, prepend an imaginary BOS
character instead.
2. Run the NFA until it has consumed the character *after* the given
substring, or an imaginary following EOS character if the substring is at
the end of the input.
3. If the NFA is (or can be) in the goal state at this point, it matches.
So one can mentally execute an untransformed NFA by taking ^ and $ as
ordinary constraints that match at start and end of input; but plain
arcs out of the start state should be taken as matches for the character
before the target substring, and similarly, plain arcs leading to the
post state are matches for the character after the target substring.
This definition is necessary to support regexes that begin or end with
constraints such as \m and \M, which imply requirements on the adjacent
character if any. NFAs for simple unanchored patterns will usually have
pre-state outarcs for all possible character colors as well as BOS and
BOL, and post-state inarcs for all possible character colors as well as
EOS and EOL, so that the executor's behavior will work.

View File

@ -823,14 +823,11 @@ moveins(struct nfa * nfa,
/*
* copyins - copy in arcs of a state to another state
*
* Either all arcs, or only non-empty ones as determined by all value.
*/
static void
copyins(struct nfa * nfa,
struct state * oldState,
struct state * newState,
int all)
struct state * newState)
{
assert(oldState != newState);
@ -840,8 +837,7 @@ copyins(struct nfa * nfa,
struct arc *a;
for (a = oldState->ins; a != NULL; a = a->inchain)
if (all || a->type != EMPTY)
cparc(nfa, a, a->from, newState);
cparc(nfa, a, a->from, newState);
}
else
{
@ -873,12 +869,6 @@ copyins(struct nfa * nfa,
{
struct arc *a = oa;
if (!all && a->type == EMPTY)
{
oa = oa->inchain;
continue;
}
switch (sortins_cmp(&oa, &na))
{
case -1:
@ -904,12 +894,6 @@ copyins(struct nfa * nfa,
/* newState does not have anything matching oa */
struct arc *a = oa;
if (!all && a->type == EMPTY)
{
oa = oa->inchain;
continue;
}
oa = oa->inchain;
createarc(nfa, a->type, a->co, a->from, newState);
}
@ -1107,14 +1091,11 @@ moveouts(struct nfa * nfa,
/*
* copyouts - copy out arcs of a state to another state
*
* Either all arcs, or only non-empty ones as determined by all value.
*/
static void
copyouts(struct nfa * nfa,
struct state * oldState,
struct state * newState,
int all)
struct state * newState)
{
assert(oldState != newState);
@ -1124,8 +1105,7 @@ copyouts(struct nfa * nfa,
struct arc *a;
for (a = oldState->outs; a != NULL; a = a->outchain)
if (all || a->type != EMPTY)
cparc(nfa, a, newState, a->to);
cparc(nfa, a, newState, a->to);
}
else
{
@ -1157,12 +1137,6 @@ copyouts(struct nfa * nfa,
{
struct arc *a = oa;
if (!all && a->type == EMPTY)
{
oa = oa->outchain;
continue;
}
switch (sortouts_cmp(&oa, &na))
{
case -1:
@ -1188,12 +1162,6 @@ copyouts(struct nfa * nfa,
/* newState does not have anything matching oa */
struct arc *a = oa;
if (!all && a->type == EMPTY)
{
oa = oa->outchain;
continue;
}
oa = oa->outchain;
createarc(nfa, a->type, a->co, newState, a->to);
}
@ -1452,6 +1420,10 @@ optimize(struct nfa * nfa,
fprintf(f, "\nfinal cleanup:\n");
#endif
cleanup(nfa); /* final tidying */
#ifdef REG_DEBUG
if (verbose)
dumpnfa(nfa, f);
#endif
return analyze(nfa); /* and analysis */
}
@ -1568,7 +1540,7 @@ pull(struct nfa * nfa,
s = newstate(nfa);
if (NISERR())
return 0;
copyins(nfa, from, s, 1); /* duplicate inarcs */
copyins(nfa, from, s); /* duplicate inarcs */
cparc(nfa, con, s, to); /* move constraint arc */
freearc(nfa, con);
if (NISERR())
@ -1735,7 +1707,7 @@ push(struct nfa * nfa,
s = newstate(nfa);
if (NISERR())
return 0;
copyouts(nfa, to, s, 1); /* duplicate outarcs */
copyouts(nfa, to, s); /* duplicate outarcs */
cparc(nfa, con, from, s); /* move constraint arc */
freearc(nfa, con);
if (NISERR())
@ -2952,6 +2924,8 @@ dumpnfa(struct nfa * nfa,
{
#ifdef REG_DEBUG
struct state *s;
int nstates = 0;
int narcs = 0;
fprintf(f, "pre %d, post %d", nfa->pre->no, nfa->post->no);
if (nfa->bos[0] != COLORLESS)
@ -2964,7 +2938,12 @@ dumpnfa(struct nfa * nfa,
fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
fprintf(f, "\n");
for (s = nfa->states; s != NULL; s = s->next)
{
dumpstate(s, f);
nstates++;
narcs += s->nouts;
}
fprintf(f, "total of %d states, %d arcs\n", nstates, narcs);
if (nfa->parent == NULL)
dumpcolors(nfa->cm, f);
fflush(f);

View File

@ -136,10 +136,10 @@ static int sortins_cmp(const void *, const void *);
static void sortouts(struct nfa *, struct state *);
static int sortouts_cmp(const void *, const void *);
static void moveins(struct nfa *, struct state *, struct state *);
static void copyins(struct nfa *, struct state *, struct state *, int);
static void copyins(struct nfa *, struct state *, struct state *);
static void mergeins(struct nfa *, struct state *, struct arc **, int);
static void moveouts(struct nfa *, struct state *, struct state *);
static void copyouts(struct nfa *, struct state *, struct state *, int);
static void copyouts(struct nfa *, struct state *, struct state *);
static void cloneouts(struct nfa *, struct state *, struct state *, struct state *, int);
static void delsub(struct nfa *, struct state *, struct state *);
static void deltraverse(struct nfa *, struct state *, struct state *);
@ -181,7 +181,6 @@ static void dumpnfa(struct nfa *, FILE *);
#ifdef REG_DEBUG
static void dumpstate(struct state *, FILE *);
static void dumparcs(struct state *, FILE *);
static int dumprarcs(struct arc *, struct state *, FILE *, int);
static void dumparc(struct arc *, struct state *, FILE *);
static void dumpcnfa(struct cnfa *, FILE *);
static void dumpcstate(int, struct cnfa *, FILE *);
@ -614,7 +613,9 @@ makesearch(struct vars * v,
for (s = slist; s != NULL; s = s2)
{
s2 = newstate(nfa);
copyouts(nfa, s, s2, 1);
NOERR();
copyouts(nfa, s, s2);
NOERR();
for (a = s->ins; a != NULL; a = b)
{
b = a->inchain;
@ -2014,7 +2015,7 @@ dump(regex_t *re,
dumpcolors(&g->cmap, f);
if (!NULLCNFA(g->search))
{
printf("\nsearch:\n");
fprintf(f, "\nsearch:\n");
dumpcnfa(&g->search, f);
}
for (i = 1; i < g->nlacons; i++)

View File

@ -229,6 +229,41 @@ select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)';
t
(1 row)
-- These cases used to give too-many-states failures
select 'x' ~ 'abcd(\m)+xyz';
?column?
----------
f
(1 row)
select 'a' ~ '^abcd*(((((^(a c(e?d)a+|)+|)+|)+|)+|a)+|)';
?column?
----------
f
(1 row)
select 'x' ~ 'a^(^)bcd*xy(((((($a+|)+|)+|)+$|)+|)+|)^$';
?column?
----------
f
(1 row)
select 'x' ~ 'xyz(\Y\Y)+';
?column?
----------
f
(1 row)
select 'x' ~ 'x|(?:\M)+';
?column?
----------
t
(1 row)
-- This generates O(N) states but O(N^2) arcs, so it causes problems
-- if arc count is not constrained
select 'x' ~ repeat('x*y*z*', 1000);
ERROR: invalid regular expression: regular expression is too complex
-- Test backref in combination with non-greedy quantifier
-- https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0
select 'Programmer' ~ '(\w).*?\1' as t;

View File

@ -55,6 +55,17 @@ select 'dd x' ~ '(^(?!aa)(?!bb)(?!cc))+';
select 'a' ~ '((((((a)*)*)*)*)*)*';
select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)';
-- These cases used to give too-many-states failures
select 'x' ~ 'abcd(\m)+xyz';
select 'a' ~ '^abcd*(((((^(a c(e?d)a+|)+|)+|)+|)+|a)+|)';
select 'x' ~ 'a^(^)bcd*xy(((((($a+|)+|)+|)+$|)+|)+|)^$';
select 'x' ~ 'xyz(\Y\Y)+';
select 'x' ~ 'x|(?:\M)+';
-- This generates O(N) states but O(N^2) arcs, so it causes problems
-- if arc count is not constrained
select 'x' ~ repeat('x*y*z*', 1000);
-- Test backref in combination with non-greedy quantifier
-- https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0
select 'Programmer' ~ '(\w).*?\1' as t;