Commit Graph

153 Commits

Author SHA1 Message Date
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Tom Lane ed87e19807 Mop-up for commit 85feb77aa0.
Adjust commentary in regc_pg_locale.c to remove mention of the possibility
of not having <wctype.h> functions, since we no longer consider that.

Eliminate duplicate code in wparser_def.c by generalizing the p_iswhat
macro to take a parameter saying what to return for non-ASCII chars
in C locale.  (That's not really a consequence of the
USE_WIDE_UPPER_LOWER-ectomy, but I noticed it while doing that.)
2017-09-22 11:35:12 -04:00
Tom Lane 85feb77aa0 Assume wcstombs(), towlower(), and sibling functions are always present.
These functions are required by SUS v2, which is our minimum baseline
for Unix platforms, and are present on all interesting Windows versions
as well.  Even our oldest buildfarm members have them.  Thus, we were not
testing the "!USE_WIDE_UPPER_LOWER" code paths, which explains why the bug
fixed in commit e6023ee7f escaped detection.  Per discussion, there seems
to be no more real-world value in maintaining this option.  Hence, remove
the configure-time tests for wcstombs() and towlower(), remove the
USE_WIDE_UPPER_LOWER symbol, and remove all the !USE_WIDE_UPPER_LOWER code.
There's not actually all that much of the latter, but simplifying the #if
nests is a win in itself.

Discussion: https://postgr.es/m/20170921052928.GA188913@rfd.leadboat.com
2017-09-22 11:00:58 -04:00
Tom Lane c7b8998ebb Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.

Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code.  The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there.  BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs.  So the
net result is that in about half the cases, such comments are placed
one tab stop left of before.  This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.

Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:19:25 -04:00
Tom Lane e3860ffa4d Initial pgindent run with pg_bsd_indent version 2.0.
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:

* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
  sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
  well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
  with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
  than the expected column 33.

On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list.  This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.

There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses.  I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 14:39:04 -04:00
Tom Lane 6cfaffc0dd Fix regexport.c to behave sanely with lookaround constraints.
regexport.c thought it could just ignore LACON arcs, but the correct
behavior is to treat them as satisfiable while consuming zero input
(rather reminiscently of commit 9f1e642d5).  Otherwise, the emitted
simplified-NFA representation may contain no paths leading from initial
to final state, which unsurprisingly confuses pg_trgm, as seen in
bug #14623 from Jeff Janes.

Since regexport's output representation has no concept of an arc that
consumes zero input, recurse internally to find the next normal arc(s)
after any LACON transitions.  We'd be forced into changing that
representation if a LACON could be the last arc reaching the final
state, but fortunately the regex library never builds NFAs with such
a configuration, so there always is a next normal arc.

Back-patch to 9.3 where this logic was introduced.

Discussion: https://postgr.es/m/20170413180503.25948.94871@wrigleys.postgresql.org
2017-04-13 17:18:35 -04:00
Peter Eisentraut eccfef81e1 ICU support
Add a column collprovider to pg_collation that determines which library
provides the collation data.  The existing choices are default and libc,
and this adds an icu choice, which uses the ICU4C library.

The pg_locale_t type is changed to a union that contains the
provider-specific locale handles.  Users of locale information are
changed to look into that struct for the appropriate handle to use.

Also add a collversion column that records the version of the collation
when it is created, and check at run time whether it is still the same.
This detects potentially incompatible library upgrades that can corrupt
indexes and other structures.  This is currently only supported by
ICU-provided collations.

initdb initializes the default collation set as before from the `locale
-a` output but also adds all available ICU locales with a "-x-icu"
appended.

Currently, ICU-provided collations can only be explicitly named
collations.  The global database locales are still always libc-provided.

ICU support is enabled by configure --with-icu.

Reviewed-by: Thomas Munro <thomas.munro@enterprisedb.com>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
2017-03-23 15:28:48 -04:00
Peter Eisentraut f97a028d8e Spelling fixes in code comments
From: Josh Soref <jsoref@gmail.com>
2017-03-14 12:58:39 -04:00
Tom Lane d6b059ec74 Document intentional violations of header inclusion policy.
Although there are good reasons for our policy of including postgres.h
as the first #include in every .c file, never from .h files, there are
two places where it seems expedient to violate the policy because the
alternative is to modify externally-supplied .c files.  (In the case
of the regexp library, the idea that it's externally-supplied is kind
of at odds with reality, but I haven't entirely given up hope that it
will become a standalone project some day.)  Add some comments to make
it explicit that this is a policy violation and provide the reasoning.

In passing, move #include "miscadmin.h" out of regcomp.c and into
regcustom.h, which is where it should be if we're taking this reasoning
seriously at all.

Discussion: https://postgr.es/m/CAEepm=2zCoeq3QxVwhS5DFeUh=yU6z81pbWMgfOB8OzyiBwxzw@mail.gmail.com
Discussion: https://postgr.es/m/11634.1488932128@sss.pgh.pa.us
2017-03-08 17:01:13 -05:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Tom Lane c54159d44c Make locale-dependent regex character classes work for large char codes.
Previously, we failed to recognize Unicode characters above U+7FF as
being members of locale-dependent character classes such as [[:alpha:]].
(Actually, the same problem occurs for large pg_wchar values in any
multibyte encoding, but UTF8 is the only case people have actually
complained about.)  It's impractical to get Spencer's original code to
handle character classes or ranges containing many thousands of characters,
because it insists on considering each member character individually at
regex compile time, whether or not the character will ever be of interest
at run time.  To fix, choose a cutoff point MAX_SIMPLE_CHR below which
we process characters individually as before, and deal with entire ranges
or classes as single entities above that.  We can actually make things
cheaper than before for chars below the cutoff, because the color map can
now be a simple linear array for those chars, rather than the multilevel
tree structure Spencer designed.  It's more expensive than before for
chars above the cutoff, because we must do a binary search in a list of
high chars and char ranges used in the regex pattern, plus call iswalpha()
and friends for each locale-dependent character class used in the pattern.
However, multibyte encodings are normally designed to give smaller codes
to popular characters, so that we can expect that the slow path will be
taken relatively infrequently.  In any case, the speed penalty appears
minor except when we have to apply iswalpha() etc. to high character codes
at runtime --- and the previous coding gave wrong answers for those cases,
so whether it was faster is moot.

Tom Lane, reviewed by Heikki Linnakangas

Discussion: <15563.1471913698@sss.pgh.pa.us>
2016-09-05 17:06:29 -04:00
Tom Lane a859e64003 Clean up another pre-ANSI-C-ism in regex code: get rid of pcolor typedef.
pcolor was used to represent function arguments that are nominally of
type color, but when using a pre-ANSI C compiler would be passed as the
promoted integer type.  We really don't need that anymore.
2016-08-19 13:31:10 -04:00
Tom Lane 6eefd2422e Remove typedef celt from the regex library, along with macro NOCELT.
The regex library used to have a notion of a "collating element" that was
distinct from a "character", but Henry Spencer never actually implemented
his planned support for multi-character collating elements, and the Tcl
crew ripped out most of the stubs for that years ago.  The only thing left
that distinguished the "celt" typedef from the "chr" typedef was that
"celt" was supposed to also be able to hold the not-a-character "NOCELT"
value.  However, NOCELT was not used anywhere after the MCCE stub removal
changes, which means there's no need for celt to be different from chr.
Removing the separate typedef simplifies matters and also removes a trap
for the unwary, in that celt is signed while chr may not be, so comparisons
could mean different things.  There's no bug there today because we
restrict CHR_MAX to be less than INT_MAX, but I think there may have been
such bugs before we did that, and there could be again if anyone ever
decides to fool with the range of chr.

This patch also removes assorted unnecessary casts to "chr" of values
that are already chrs.  Many of these seem to be leftover from days when
the code was compatible with pre-ANSI C.
2016-08-19 12:51:02 -04:00
Tom Lane 8c95ae81fa Suppress compiler warnings about useless comparison of unsigned to zero.
Reportedly, some compilers warn about tests like "c < 0" if c is unsigned,
and hence complain about the character range checks I added in commit
3bb3f42f37.  This is a bit of a pain since
the regex library doesn't really want to assume that chr is unsigned.
However, since any such reconfiguration would involve manual edits of
regcustom.h anyway, we can put it on the shoulders of whoever wants to
do that to adjust this new range-checking macro correctly.

Per gripes from Coverity and Andres.
2016-02-15 17:12:16 -05:00
Tom Lane 3bb3f42f37 Fix some regex issues with out-of-range characters and large char ranges.
Previously, our regex code defined CHR_MAX as 0xfffffffe, which is a
bad choice because it is outside the range of type "celt" (int32).
Characters approaching that limit could lead to infinite loops in logic
such as "for (c = a; c <= b; c++)" where c is of type celt but the
range bounds are chr.  Such loops will work safely only if CHR_MAX+1
is representable in celt, since c must advance to beyond b before the
loop will exit.

Fortunately, there seems no reason not to restrict CHR_MAX to 0x7ffffffe.
It's highly unlikely that Unicode will ever assign codes that high, and
none of our other backend encodings need characters beyond that either.

In addition to modifying the macro, we have to explicitly enforce character
range restrictions on the values of \u, \U, and \x escape sequences, else
the limit is trivially bypassed.

Also, the code for expanding case-independent character ranges in bracket
expressions had a potential integer overflow in its calculation of the
number of characters it could generate, which could lead to allocating too
small a character vector and then overwriting memory.  An attacker with the
ability to supply arbitrary regex patterns could easily cause transient DOS
via server crashes, and the possibility for privilege escalation has not
been ruled out.

Quite aside from the integer-overflow problem, the range expansion code was
unnecessarily inefficient in that it always produced a result consisting of
individual characters, abandoning the knowledge that we had a range to
start with.  If the input range is large, this requires excessive memory.
Change it so that the original range is reported as-is, and then we add on
any case-equivalent characters that are outside that range.  With this
approach, we can bound the number of individual characters allowed without
sacrificing much.  This patch allows at most 100000 individual characters,
which I believe to be more than the number of case pairs existing in
Unicode, so that the restriction will never be hit in practice.

It's still possible for range() to take awhile given a large character code
range, so also add statement-cancel detection to its loop.  The downstream
function dovec() also lacked cancel detection, and could take a long time
given a large output from range().

Per fuzz testing by Greg Stark.  Back-patch to all supported branches.

Security: CVE-2016-0773
2016-02-08 10:25:40 -05:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Tom Lane a43b4ab111 Fix enforcement of restrictions inside regexp lookaround constraints.
Lookahead and lookbehind constraints aren't allowed to contain backrefs,
and parentheses within them are always considered non-capturing.  Or so
says the manual.  But the regexp parser forgot about these rules once
inside a parenthesized subexpression, so that constructs like (\w)(?=(\1))
were accepted (but then not correctly executed --- a case like this acted
like (\w)(?=\w), without any enforcement that the two \w's match the same
text).  And in (?=((foo))) the innermost parentheses would be counted as
capturing parentheses, though no text would ever be captured for them.

To fix, properly pass down the "type" argument to the recursive invocation
of parse().

Back-patch to all supported branches; it was agreed that silent
misexecution of such patterns is worse than throwing an error, even though
new errors in minor releases are generally not desirable.
2015-11-07 12:43:24 -05:00
Tom Lane 12c9a04008 Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point.  This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.

Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked.  In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall.  To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops.  This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space.  So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.

I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine.  This avoids the overhead
of the LACON machinery entirely for these rather common cases.

The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely.  So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares.  (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)
2015-10-30 19:14:19 -04:00
Tom Lane 9f1e642d50 Fix incorrect handling of lookahead constraints in pg_regprefix().
pg_regprefix was doing nothing with lookahead constraints, which would
be fine if it were the right kind of nothing, but it isn't: we have to
terminate our search for a fixed prefix, not just pretend the LACON arc
isn't there.  Otherwise, if the current state has both a LACON outarc and a
single plain-color outarc, we'd falsely conclude that the color represents
an addition to the fixed prefix, and generate an extracted index condition
that restricts the indexscan too much.  (See added regression test case.)

Terminating the search is conservative: we could traverse the LACON arc
(thus assuming that the constraint can be satisfied at runtime) and then
examine the outarcs of the linked-to state.  But that would be a lot more
work than it seems worth, because writing a LACON followed by a single
plain character is a pretty silly thing to do.

This makes a difference only in rather contrived cases, but it's a bug,
so back-patch to all supported branches.
2015-10-19 13:54:53 -07:00
Tom Lane afdfcd3f76 Miscellaneous cleanup of regular-expression compiler.
Revert our previous addition of "all" flags to copyins() and copyouts();
they're no longer needed, and were never anything but an unsightly hack.

Improve a couple of infelicities in the REG_DEBUG code for dumping
the NFA data structure, including adding code to count the total
number of states and arcs.

Add a couple of missed error checks.

Add some more documentation in the README file, and some regression tests
illustrating cases that exceeded the state-count limit and/or took
unreasonable amounts of time before this set of patches.

Back-patch to all supported branches.
2015-10-16 15:55:59 -04:00
Tom Lane 538b3b8b35 Improve memory-usage accounting in regular-expression compiler.
This code previously counted the number of NFA states it created, and
complained if a limit was exceeded, so as to prevent bizarre regex patterns
from consuming unreasonable time or memory.  That's fine as far as it went,
but the code paid no attention to how many arcs linked those states.  Since
regexes can be contrived that have O(N) states but will need O(N^2) arcs
after fixempties() processing, it was still possible to blow out memory,
and take a long time doing it too.  To fix, modify the bookkeeping to count
space used by both states and arcs.

I did not bother with including the "color map" in the accounting; it
can only grow to a few megabytes, which is not a lot in comparison to
what we're allowing for states+arcs (about 150MB on 64-bit machines
or half that on 32-bit machines).

Looking at some of the larger real-world regexes captured in the Tcl
regression test suite suggests that the most that is likely to be needed
for regexes found in the wild is under 10MB, so I believe that the current
limit has enough headroom to make it okay to keep it as a hard-wired limit.

In connection with this, redefine REG_ETOOBIG as meaning "regular
expression is too complex"; the previous wording of "nfa has too many
states" was already somewhat inapropos because of the error code's use
for stack depth overrun, and it was not very user-friendly either.

Back-patch to all supported branches.
2015-10-16 15:55:59 -04:00
Tom Lane 6a7153661d Improve performance of pullback/pushfwd in regular-expression compiler.
The previous coding would create a new intermediate state every time it
wanted to interchange the ordering of two constraint arcs.  Certain regex
features such as \Y can generate large numbers of parallel constraint arcs,
and if we needed to reorder the results of that, we created unreasonable
numbers of intermediate states.  To improve matters, keep a list of
already-created intermediate states associated with the state currently
being considered by the outer loop; we can re-use such states to place all
the new arcs leading to the same destination or source.

I also took the trouble to redefine push() and pull() to have a less risky
API: they no longer delete any state or arc that the caller might possibly
have a pointer to, except for the specifically-passed constraint arc.
This reduces the risk of re-introducing the same type of error seen in
the failed patch for CVE-2007-4772.

Back-patch to all supported branches.
2015-10-16 15:55:59 -04:00
Tom Lane f5b7d103bc Improve performance of fixempties() pass in regular-expression compiler.
The previous coding took something like O(N^4) time to fully process a
chain of N EMPTY arcs.  We can't really do much better than O(N^2) because
we have to insert about that many arcs, but we can do lots better than
what's there now.  The win comes partly from using mergeins() to amortize
de-duplication of arcs across multiple source states, and partly from
exploiting knowledge of the ordering of arcs for each state to avoid
looking at arcs we don't need to consider during the scan.  We do have
to be a bit careful of the possible reordering of arcs introduced by
the sort-merge coding of the previous commit, but that's not hard to
deal with.

Back-patch to all supported branches.
2015-10-16 15:55:59 -04:00
Tom Lane 579840ca05 Fix O(N^2) performance problems in regular-expression compiler.
Change the singly-linked in-arc and out-arc lists to be doubly-linked,
so that arc deletion is constant time rather than having worst-case time
proportional to the number of other arcs on the connected states.

Modify the bulk arc transfer operations copyins(), copyouts(), moveins(),
moveouts() so that they use a sort-and-merge algorithm whenever there's
more than a small number of arcs to be copied or moved.  The previous
method is O(N^2) in the number of arcs involved, because it performs
duplicate checking independently for each copied arc.  The new method may
change the ordering of existing arcs for the destination state, but nothing
really cares about that.

Provide another bulk arc copying method mergeins(), which is unused as
of this commit but is needed for the next one.  It basically is like
copyins(), but the source arcs might not all come from the same state.

Replace the O(N^2) bubble-sort algorithm used in carcsort() with a qsort()
call.

These changes greatly improve the performance of regex compilation for
large or complex regexes, at the cost of extra space for arc storage during
compilation.  The original tradeoff was probably fine when it was made, but
now we care more about speed and less about memory consumption.

Back-patch to all supported branches.
2015-10-16 15:55:59 -04:00
Tom Lane 48789c5d23 Fix regular-expression compiler to handle loops of constraint arcs.
It's possible to construct regular expressions that contain loops of
constraint arcs (that is, ^ $ AHEAD BEHIND or LACON arcs).  There's no use
in fully traversing such a loop at execution, since you'd just end up in
the same NFA state without having consumed any input.  Worse, such a loop
leads to infinite looping in the pullback/pushfwd stage of compilation,
because we keep pushing or pulling the same constraints around the loop
in a vain attempt to move them to the pre or post state.  Such looping was
previously recognized in CVE-2007-4772; but the fix only handled the case
of trivial single-state loops (that is, a constraint arc leading back to
its source state) ... and not only that, it was incorrect even for that
case, because it broke the admittedly-not-very-clearly-stated API contract
of the pull() and push() subroutines.  The first two regression test cases
added by this commit exhibit patterns that result in assertion failures
because of that (though there seem to be no ill effects in non-assert
builds).  The other new test cases exhibit multi-state constraint loops;
in an unpatched build they will run until the NFA state-count limit is
exceeded.

To fix, remove the code added for CVE-2007-4772, and instead create a
general-purpose constraint-loop-breaking phase of regex compilation that
executes before we do pullback/pushfwd.  Since we never need to traverse
a constraint loop fully, we can just break the loop at any chosen spot,
if we add clone states that can replicate any sequence of arc transitions
that would've traversed just part of the loop.

Also add some commentary clarifying why we have to have all these
machinations in the first place.

This class of problems has been known for some time --- we had a report
from Marc Mamin about two years ago, for example, and there are related
complaints in the Tcl bug tracker.  I had discussed a fix of this kind
off-list with Henry Spencer, but didn't get around to doing something
about it until the issue was rediscovered by Greg Stark recently.

Back-patch to all supported branches.
2015-10-16 15:55:58 -04:00
Tom Lane b63fc28776 Add recursion depth protections to regular expression matching.
Some of the functions in regex compilation and execution recurse, and
therefore could in principle be driven to stack overflow.  The Tcl crew
has seen this happen in practice in duptraverse(), though their fix was
to put in a hard-wired limit on the number of recursive levels, which is
not too appetizing --- fortunately, we have enough infrastructure to check
the actually available stack.  Greg Stark has also seen it in other places
while fuzz testing on a machine with limited stack space.  Let's put guards
in to prevent crashes in all these places.

Since the regex code would leak memory if we simply threw elog(ERROR),
we have to introduce an API that checks for stack depth without throwing
such an error.  Fortunately that's not difficult.
2015-10-02 14:51:58 -04:00
Tom Lane f2c4ffc330 Fix potential infinite loop in regular expression execution.
In cfindloop(), if the initial call to shortest() reports that a
zero-length match is possible at the current search start point, but then
it is unable to construct any actual match to that, it'll just loop around
with the same start point, and thus make no progress.  We need to force the
start point to be advanced.  This is safe because the loop over "begin"
points has already tried and failed to match starting at "close", so there
is surely no need to try that again.

This bug was introduced in commit e2bd904955,
wherein we allowed continued searching after we'd run out of match
possibilities, but evidently failed to think hard enough about exactly
where we needed to search next.

Because of the way this code works, such a match failure is only possible
in the presence of backrefs --- otherwise, shortest()'s judgment that a
match is possible should always be correct.  That probably explains how
come the bug has escaped detection for several years.

The actual fix is a one-liner, but I took the trouble to add/improve some
comments related to the loop logic.

After fixing that, the submitted test case "()*\1" didn't loop anymore.
But it reported failure, though it seems like it ought to match a
zero-length string; both Tcl and Perl think it does.  That seems to be from
overenthusiastic optimization on my part when I rewrote the iteration match
logic in commit 173e29aa5deefd9e71c183583ba37805c8102a72: we can't just
"declare victory" for a zero-length match without bothering to set match
data for capturing parens inside the iterator node.

Per fuzz testing by Greg Stark.  The first part of this is a bug in all
supported branches, and the second part is a bug since 9.2 where the
iteration rewrite happened.
2015-10-02 14:26:36 -04:00
Tom Lane 9fe8fe9c9e Add some more query-cancel checks to regular expression matching.
Commit 9662143f0c added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc.  However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request.  Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate().  Add one to newarc() as well to cover that.

Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel.  We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest().  Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.

Making this work required more attention to error handling throughout the
regex executor.  Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors.  However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed.  I took the
opportunity to add some more comments to the code, too.

Back-patch to all supported branches, like the previous patch.
2015-10-02 13:45:39 -04:00
Tom Lane d9c0c728af Fix low-probability memory leak in regex execution.
After an internal failure in shortest() or longest() while pinning down the
exact location of a match, find() forgot to free the DFA structure before
returning.  This is pretty unlikely to occur, since we just successfully
ran the "search" variant of the DFA; but it could happen, and it would
result in a session-lifespan memory leak since this code uses malloc()
directly.  Problem seems to have been aboriginal in Spencer's library,
so back-patch all the way.

In passing, correct a thinko in a comment I added awhile back about the
meaning of the "ntree" field.

I happened across these issues while comparing our code to Tcl's version
of the library.
2015-09-18 13:55:17 -04:00
Tom Lane b44d92b67b Sync regex code with Tcl 8.6.4.
Sync our regex code with upstream changes since last time we did this,
which was Tcl 8.5.11 (see commit 08fd6ff37f).

The only functional change here is to disbelieve that an octal escape is
three digits long if it would exceed \377.  That's a bug fix, but it's
a minor one and could change the interpretation of working regexes, so
don't back-patch.

In addition to that, s/INFINITY/DUPINF/ to eliminate the risk of collisions
with <math.h>'s macro, and s/LOCAL/NOPROP/ because that also seems like
an unnecessarily collision-prone macro name.

There were some other cosmetic changes in their copy that I did not adopt,
notably a rather half-hearted attempt at renaming some of the C functions
in a more verbose style.  (I'm not necessarily against the concept, but
renaming just a few functions in the package is not an improvement.)
2015-09-16 15:25:25 -04:00
Tom Lane 91cf3135b9 Fix minor bug in regexp makesearch() function.
The list-wrangling here was done wrong, allowing the same state to get
put into the list twice.  The following loop then would clone it twice.
The second clone would wind up with no inarcs, so that there was no
observable misbehavior AFAICT, but a useless state in the finished NFA
isn't an especially good thing.
2015-09-09 20:14:58 -04:00
Tom Lane 8a0258c318 Fix some possible low-memory failures in regexp compilation.
newnfa() failed to set the regex error state when malloc() fails.
Several places in regcomp.c failed to check for an error after calling
subre().  Each of these mistakes could lead to null-pointer-dereference
crashes in memory-starved backends.

Report and patch by Andreas Seltenreich.  Back-patch to all branches.
2015-08-12 00:48:11 -04:00
Tom Lane 586dd5d6a5 Replace a bunch more uses of strncpy() with safer coding.
strncpy() has a well-deserved reputation for being unsafe, so make an
effort to get rid of nearly all occurrences in HEAD.

A large fraction of the remaining uses were passing length less than or
equal to the known strlen() of the source, in which case no null-padding
can occur and the behavior is equivalent to memcpy(), though doubtless
slower and certainly harder to reason about.  So just use memcpy() in
these cases.

In other cases, use either StrNCpy() or strlcpy() as appropriate (depending
on whether padding to the full length of the destination buffer seems
useful).

I left a few strncpy() calls alone in the src/timezone/ code, to keep it
in sync with upstream (the IANA tzcode distribution).  There are also a
few such calls in ecpg that could possibly do with more analysis.

AFAICT, none of these changes are more than cosmetic, except for the four
occurrences in fe-secure-openssl.c, which are in fact buggy: an overlength
source leads to a non-null-terminated destination buffer and ensuing
misbehavior.  These don't seem like security issues, first because no stack
clobber is possible and second because if your values of sslcert etc are
coming from untrusted sources then you've got problems way worse than this.
Still, it's undesirable to have unpredictable behavior for overlength
inputs, so back-patch those four changes to all active branches.
2015-01-24 13:05:42 -05:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Tom Lane 3694b4d7e1 Fix incorrect search for "x?" style matches in creviterdissect().
When the number of allowed iterations is limited (either a "?" quantifier
or a bound expression), the last sub-match has to reach to the end of the
target string.  The previous coding here first tried the shortest possible
match (one character, usually) and then gave up and back-tracked if that
didn't work, typically leading to failure to match overall, as shown in
bug #11478 from Christoph Berg.  The minimum change to fix that would be to
not decrement k before "goto backtrack"; but that would be a pretty stupid
solution, because we'd laboriously try each possible sub-match length
before finally discovering that only ending at the end can work.  Instead,
force the sub-match endpoint limit up to the end for even the first
shortest() call if we cannot have any more sub-matches after this one.

Bug introduced in my rewrite that added the iterdissect logic, commit
173e29aa5d.  The shortest-first search code
was too closely modeled on the longest-first code, which hasn't got this
issue since it tries a match reaching to the end to start with anyway.
Back-patch to all affected branches.
2014-09-23 20:26:14 -04:00
Tom Lane 1567e659a8 Fix two low-probability memory leaks in regular expression parsing.
If pg_regcomp failed after having invoked markst/cleanst, it would leak any
"struct subre" nodes it had created.  (We've already detected all regex
syntax errors at that point, so the only likely causes of later failure
would be query cancel or out-of-memory.)  To fix, make sure freesrnode
knows the difference between the pre-cleanst and post-cleanst cleanup
procedures.  Add some documentation of this less-than-obvious point.

Also, newlacon did the wrong thing with an out-of-memory failure from
realloc(), so that the previously allocated array would be leaked.

Both of these are pretty low-probability scenarios, but a bug is a bug,
so patch all the way back.

Per bug #10976 from Arthur O'Dwyer.
2014-07-18 13:00:27 -04:00
Bruce Momjian 0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Tom Lane ea8c7e9054 Fix memory leak during regular expression execution.
For a regex containing backrefs, pg_regexec() might fail to free all the
sub-DFAs that were created during execution, resulting in a permanent
(session lifespan) memory leak.  Problem was introduced by me in commit
587359479a.  Per report from Sandro Santilli;
diagnosis by Greg Stark.
2014-03-19 11:09:24 -04:00
Tom Lane 9662143f0c Allow regex operations to be terminated early by query cancel requests.
The regex code didn't have any provision for query cancel; which is
unsurprising given its non-Postgres origin, but still problematic since
some operations can take a long time.  Introduce a callback function to
check for a pending query cancel or session termination request, and
call it in a couple of strategic spots where we can make the regex code
exit with an error indicator.

If we ever actually split out the regex code as a standalone library,
some additional work will be needed to let the cancel callback function
be specified externally to the library.  But that's straightforward
(certainly so by comparison to putting the locale-dependent character
classification logic on a similar arms-length basis), and there seems
no need to do it right now.

A bigger issue is that there may be more places than these two where
we need to check for cancels.  We can always add more checks later,
now that the infrastructure is in place.

Since there are known examples of not-terribly-long regexes that can
lock up a backend for a long time, back-patch to all supported branches.
I have hopes of fixing the known performance problems later, but adding
query cancel ability seems like a good idea even if they were all fixed.
2014-03-01 15:20:56 -05:00
Tom Lane 0d79c0a8cc Make various variables const (read-only).
These changes should generally improve correctness/maintainability.
A nice side benefit is that several kilobytes move from initialized
data to text segment, allowing them to be shared across processes and
probably reducing copy-on-write overhead while forking a new backend.
Unfortunately this doesn't seem to help libpq in the same way (at least
not when it's compiled with -fpic on x86_64), but we can hope the linker
at least collects all nominally-const data together even if it's not
actually part of the text segment.

Also, make pg_encname_tbl[] static in encnames.c, since there seems
no very good reason for any other code to use it; per a suggestion
from Wim Lewis, who independently submitted a patch that was mostly
a subset of this one.

Oskari Saarenmaa, with some editorialization by me
2014-01-18 16:04:32 -05:00
Bruce Momjian 7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Tom Lane e2bd904955 Fix regex match failures for backrefs combined with non-greedy quantifiers.
An ancient logic error in cfindloop() could cause the regex engine to fail
to find matches that begin later than the start of the string.  This
function is only used when the regex pattern contains a back reference,
and so far as we can tell the error is only reachable if the pattern is
non-greedy (i.e. its first quantifier uses the ? modifier).  Furthermore,
the actual match must begin after some potential match that satisfies the
DFA but then fails the back-reference's match test.

Reported and fixed by Jeevan Chalke, with cosmetic adjustments by me.
2013-07-18 21:22:37 -04:00
Bruce Momjian 9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Tom Lane 3ccae48f44 Support indexing of regular-expression searches in contrib/pg_trgm.
This works by extracting trigrams from the given regular expression,
in generally the same spirit as the previously-existing support for
LIKE searches, though of course the details are far more complicated.

Currently, only GIN indexes are supported.  We might be able to make
it work with GiST indexes later.

The implementation includes adding API functions to backend/regex/
to provide a view of the search NFA created from a regular expression.
These functions are meant to be generic enough to be supportable in
a standalone version of the regex library, should that ever happen.

Alexander Korotkov, reviewed by Heikki Linnakangas and Tom Lane
2013-04-09 01:06:54 -04:00
Heikki Linnakangas bf2b0a1478 Fix crash on compiling a regular expression with more than 32k colors.
Throw an error instead.

Backpatch to all supported branches.
2013-04-04 19:48:11 +03:00
Tom Lane a7b61d4f5a Fix infinite-loop risk in fixempties() stage of regex compilation.
The previous coding of this function could get into situations where it
would never terminate, because successive passes would re-add EMPTY arcs
that had been removed by the previous pass.  Rewrite the function
completely using a new algorithm that is guaranteed to terminate, and
also seems to be usually faster than the old one.  Per Tcl bugs 3604074
and 3606683.

Tom Lane and Don Porter
2013-03-07 11:51:03 -05:00
Tom Lane 73dc003bee Add missing error check in regexp parser.
parseqatom() failed to check for an error return (NULL result) from its
recursive call to parsebranch(), and in consequence could crash with a
null-pointer dereference after an error return.  This bug has been there
since day one, but wasn't noticed before, probably because most error cases
in parsebranch() didn't actually lead to returning NULL.  Add the missing
error check, and also tweak parsebranch() to exit in a less indirect
fashion after a call to parseqatom() fails.

Report by Tomasz Karlik, fix by me.
2013-02-27 10:40:03 -05:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Tom Lane 54fd196ffc Prevent corner-case core dump in rfree().
rfree() failed to cope with the case that pg_regcomp() had initialized the
regex_t struct but then failed to allocate any memory for re->re_guts (ie,
the first malloc call in pg_regcomp() failed).  It would try to touch the
guts struct anyway, and thus dump core.  This is a sufficiently narrow
corner case that it's not surprising it's never been seen in the field;
but still a bug is a bug, so patch all active branches.

Noted while investigating whether we need to call pg_regfree after a
failure return from pg_regcomp.  Other than this bug, it turns out we
don't, so adjust comments appropriately.
2012-07-15 13:27:54 -04:00
Tom Lane 628cbb50ba Re-implement extraction of fixed prefixes from regular expressions.
To generate btree-indexable conditions from regex WHERE conditions (such as
WHERE indexed_col ~ '^foo'), we need to be able to identify any fixed
prefix that a regex might have; that is, find any string that must be a
prefix of all strings satisfying the regex.  We used to do that with
entirely ad-hoc code that looked at the source text of the regex.  It
didn't know very much about regex syntax, which mostly meant that it would
fail to identify some optimizable cases; but Viktor Rosenfeld reported that
it would produce actively wrong answers for quantified parenthesized
subexpressions, such as '^(foo)?bar'.  Rather than trying to extend the
ad-hoc code to cover this, let's get rid of it altogether in favor of
identifying prefixes by examining the compiled form of a regex.

To do this, I've added a new entry point "pg_regprefix" to the regex library;
hopefully it is defined in a sufficiently general fashion that it can remain
in the library when/if that code gets split out as a standalone project.

Since this bug has been there for a very long time, this fix needs to get
back-patched.  However it depends on some other recent commits (particularly
the addition of wchar-to-database-encoding conversion), so I'll commit this
separately and then go to work on back-porting the necessary fixes.
2012-07-10 14:54:37 -04:00