postgresql

Commit Graph

Author	SHA1	Message	Date
Tom Lane	8a29ed0530	Fix misoptimization of "{1,1}" quantifiers in regular expressions. A bounded quantifier with m = n = 1 might be thought a no-op. But according to our documentation (which traces back to Henry Spencer's original man page) it still imposes greediness, or non-greediness in the case of the non-greedy variant "{1,1}?", on whatever it's attached to. This turns out not to work though, because parseqatom() optimizes away the m = n = 1 case without regard for whether it's supposed to change the greediness of the argument RE. We can fix this by just not applying the optimization when the greediness needs to change; the subsequent general cases handle it fine. The three cases in which we can still apply the optimization are (a) no quantifier, or quantifier does not impose a preference; (b) atom has no greediness property, implying it cannot match a variable amount of text anyway; or (c) quantifier's greediness is same as atom's. Note that in most cases where one of these applies, we'd have exited earlier in the "not a messy case" fast path. I think it's now only possible to get to the optimization when the atom involves capturing parentheses or a non-top-level backref. Back-patch to all supported branches. I'd ordinarily be hesitant to put a subtle behavioral change into back branches, but in this case it's very hard to see a reason why somebody would write "{1,1}?" unless they're trying to get the documented change-of-greediness behavior. Discussion: https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@charter.net	2019-05-12 18:53:38 -04:00
Tom Lane	41c912cad1	Clean up warnings from -Wimplicit-fallthrough. Recent gcc can warn about switch-case fall throughs that are not explicitly labeled as intentional. This seems like a good thing, so clean up the warnings exposed thereby by labeling all such cases with comments that gcc will recognize. In files that already had one or more suitable comments, I generally matched the existing style of those. Otherwise I went with /* FALLTHROUGH /, which is one of the spellings approved at the more-restrictive-than-default level -Wimplicit-fallthrough=4. (At the default level you can also spell it / FALL ?THRU */, and it's not picky about case. What you can't do is include additional text in the same comment, so some existing comments containing versions of this aren't good enough.) Testing with gcc 8.0.1 (Fedora 28's current version), I found that I also had to put explicit "break"s after elog(ERROR) or ereport(ERROR); apparently, for this purpose gcc doesn't recognize that those don't return. That seems like possibly a gcc bug, but it's fine because in most places we did that anyway; so this amounts to a visit from the style police. Discussion: https://postgr.es/m/15083.1525207729@sss.pgh.pa.us	2018-05-01 19:35:08 -04:00
Tom Lane	c7b8998ebb	Phase 2 of pgindent updates. Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit `e3860ffa4d` wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us	2017-06-21 15:19:25 -04:00
Tom Lane	e3860ffa4d	Initial pgindent run with pg_bsd_indent version 2.0. The new indent version includes numerous fixes thanks to Piotr Stefaniak. The main changes visible in this commit are: * Nicer formatting of function-pointer declarations. * No longer unexpectedly removes spaces in expressions using casts, sizeof, or offsetof. * No longer wants to add a space in "struct structname varname", as well as some similar cases for const- or volatile-qualified pointers. Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely. * Fixes bug where comments following declarations were sometimes placed with no space separating them from the code. * Fixes some odd decisions for comments following case labels. * Fixes some cases where comments following code were indented to less than the expected column 33. On the less good side, it now tends to put more whitespace around typedef names that are not listed in typedefs.list. This might encourage us to put more effort into typedef name collection; it's not really a bug in indent itself. There are more changes coming after this round, having to do with comment indentation and alignment of lines appearing within parentheses. I wanted to limit the size of the diffs to something that could be reviewed without one's eyes completely glazing over, so it seemed better to split up the changes as much as practical. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us	2017-06-21 14:39:04 -04:00
Tom Lane	d6b059ec74	Document intentional violations of header inclusion policy. Although there are good reasons for our policy of including postgres.h as the first #include in every .c file, never from .h files, there are two places where it seems expedient to violate the policy because the alternative is to modify externally-supplied .c files. (In the case of the regexp library, the idea that it's externally-supplied is kind of at odds with reality, but I haven't entirely given up hope that it will become a standalone project some day.) Add some comments to make it explicit that this is a policy violation and provide the reasoning. In passing, move #include "miscadmin.h" out of regcomp.c and into regcustom.h, which is where it should be if we're taking this reasoning seriously at all. Discussion: https://postgr.es/m/CAEepm=2zCoeq3QxVwhS5DFeUh=yU6z81pbWMgfOB8OzyiBwxzw@mail.gmail.com Discussion: https://postgr.es/m/11634.1488932128@sss.pgh.pa.us	2017-03-08 17:01:13 -05:00
Tom Lane	c54159d44c	Make locale-dependent regex character classes work for large char codes. Previously, we failed to recognize Unicode characters above U+7FF as being members of locale-dependent character classes such as [[:alpha:]]. (Actually, the same problem occurs for large pg_wchar values in any multibyte encoding, but UTF8 is the only case people have actually complained about.) It's impractical to get Spencer's original code to handle character classes or ranges containing many thousands of characters, because it insists on considering each member character individually at regex compile time, whether or not the character will ever be of interest at run time. To fix, choose a cutoff point MAX_SIMPLE_CHR below which we process characters individually as before, and deal with entire ranges or classes as single entities above that. We can actually make things cheaper than before for chars below the cutoff, because the color map can now be a simple linear array for those chars, rather than the multilevel tree structure Spencer designed. It's more expensive than before for chars above the cutoff, because we must do a binary search in a list of high chars and char ranges used in the regex pattern, plus call iswalpha() and friends for each locale-dependent character class used in the pattern. However, multibyte encodings are normally designed to give smaller codes to popular characters, so that we can expect that the slow path will be taken relatively infrequently. In any case, the speed penalty appears minor except when we have to apply iswalpha() etc. to high character codes at runtime --- and the previous coding gave wrong answers for those cases, so whether it was faster is moot. Tom Lane, reviewed by Heikki Linnakangas Discussion: <15563.1471913698@sss.pgh.pa.us>	2016-09-05 17:06:29 -04:00
Tom Lane	a859e64003	Clean up another pre-ANSI-C-ism in regex code: get rid of pcolor typedef. pcolor was used to represent function arguments that are nominally of type color, but when using a pre-ANSI C compiler would be passed as the promoted integer type. We really don't need that anymore.	2016-08-19 13:31:10 -04:00
Tom Lane	6eefd2422e	Remove typedef celt from the regex library, along with macro NOCELT. The regex library used to have a notion of a "collating element" that was distinct from a "character", but Henry Spencer never actually implemented his planned support for multi-character collating elements, and the Tcl crew ripped out most of the stubs for that years ago. The only thing left that distinguished the "celt" typedef from the "chr" typedef was that "celt" was supposed to also be able to hold the not-a-character "NOCELT" value. However, NOCELT was not used anywhere after the MCCE stub removal changes, which means there's no need for celt to be different from chr. Removing the separate typedef simplifies matters and also removes a trap for the unwary, in that celt is signed while chr may not be, so comparisons could mean different things. There's no bug there today because we restrict CHR_MAX to be less than INT_MAX, but I think there may have been such bugs before we did that, and there could be again if anyone ever decides to fool with the range of chr. This patch also removes assorted unnecessary casts to "chr" of values that are already chrs. Many of these seem to be leftover from days when the code was compatible with pre-ANSI C.	2016-08-19 12:51:02 -04:00
Tom Lane	3bb3f42f37	Fix some regex issues with out-of-range characters and large char ranges. Previously, our regex code defined CHR_MAX as 0xfffffffe, which is a bad choice because it is outside the range of type "celt" (int32). Characters approaching that limit could lead to infinite loops in logic such as "for (c = a; c <= b; c++)" where c is of type celt but the range bounds are chr. Such loops will work safely only if CHR_MAX+1 is representable in celt, since c must advance to beyond b before the loop will exit. Fortunately, there seems no reason not to restrict CHR_MAX to 0x7ffffffe. It's highly unlikely that Unicode will ever assign codes that high, and none of our other backend encodings need characters beyond that either. In addition to modifying the macro, we have to explicitly enforce character range restrictions on the values of \u, \U, and \x escape sequences, else the limit is trivially bypassed. Also, the code for expanding case-independent character ranges in bracket expressions had a potential integer overflow in its calculation of the number of characters it could generate, which could lead to allocating too small a character vector and then overwriting memory. An attacker with the ability to supply arbitrary regex patterns could easily cause transient DOS via server crashes, and the possibility for privilege escalation has not been ruled out. Quite aside from the integer-overflow problem, the range expansion code was unnecessarily inefficient in that it always produced a result consisting of individual characters, abandoning the knowledge that we had a range to start with. If the input range is large, this requires excessive memory. Change it so that the original range is reported as-is, and then we add on any case-equivalent characters that are outside that range. With this approach, we can bound the number of individual characters allowed without sacrificing much. This patch allows at most 100000 individual characters, which I believe to be more than the number of case pairs existing in Unicode, so that the restriction will never be hit in practice. It's still possible for range() to take awhile given a large character code range, so also add statement-cancel detection to its loop. The downstream function dovec() also lacked cancel detection, and could take a long time given a large output from range(). Per fuzz testing by Greg Stark. Back-patch to all supported branches. Security: CVE-2016-0773	2016-02-08 10:25:40 -05:00
Tom Lane	a43b4ab111	Fix enforcement of restrictions inside regexp lookaround constraints. Lookahead and lookbehind constraints aren't allowed to contain backrefs, and parentheses within them are always considered non-capturing. Or so says the manual. But the regexp parser forgot about these rules once inside a parenthesized subexpression, so that constructs like (\w)(?=(\1)) were accepted (but then not correctly executed --- a case like this acted like (\w)(?=\w), without any enforcement that the two \w's match the same text). And in (?=((foo))) the innermost parentheses would be counted as capturing parentheses, though no text would ever be captured for them. To fix, properly pass down the "type" argument to the recursive invocation of parse(). Back-patch to all supported branches; it was agreed that silent misexecution of such patterns is worse than throwing an error, even though new errors in minor releases are generally not desirable.	2015-11-07 12:43:24 -05:00
Tom Lane	12c9a04008	Implement lookbehind constraints in our regular-expression engine. A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match ending at the current point in the string, rather than one starting at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)	2015-10-30 19:14:19 -04:00
Tom Lane	afdfcd3f76	Miscellaneous cleanup of regular-expression compiler. Revert our previous addition of "all" flags to copyins() and copyouts(); they're no longer needed, and were never anything but an unsightly hack. Improve a couple of infelicities in the REG_DEBUG code for dumping the NFA data structure, including adding code to count the total number of states and arcs. Add a couple of missed error checks. Add some more documentation in the README file, and some regression tests illustrating cases that exceeded the state-count limit and/or took unreasonable amounts of time before this set of patches. Back-patch to all supported branches.	2015-10-16 15:55:59 -04:00
Tom Lane	538b3b8b35	Improve memory-usage accounting in regular-expression compiler. This code previously counted the number of NFA states it created, and complained if a limit was exceeded, so as to prevent bizarre regex patterns from consuming unreasonable time or memory. That's fine as far as it went, but the code paid no attention to how many arcs linked those states. Since regexes can be contrived that have O(N) states but will need O(N^2) arcs after fixempties() processing, it was still possible to blow out memory, and take a long time doing it too. To fix, modify the bookkeeping to count space used by both states and arcs. I did not bother with including the "color map" in the accounting; it can only grow to a few megabytes, which is not a lot in comparison to what we're allowing for states+arcs (about 150MB on 64-bit machines or half that on 32-bit machines). Looking at some of the larger real-world regexes captured in the Tcl regression test suite suggests that the most that is likely to be needed for regexes found in the wild is under 10MB, so I believe that the current limit has enough headroom to make it okay to keep it as a hard-wired limit. In connection with this, redefine REG_ETOOBIG as meaning "regular expression is too complex"; the previous wording of "nfa has too many states" was already somewhat inapropos because of the error code's use for stack depth overrun, and it was not very user-friendly either. Back-patch to all supported branches.	2015-10-16 15:55:59 -04:00
Tom Lane	6a7153661d	Improve performance of pullback/pushfwd in regular-expression compiler. The previous coding would create a new intermediate state every time it wanted to interchange the ordering of two constraint arcs. Certain regex features such as \Y can generate large numbers of parallel constraint arcs, and if we needed to reorder the results of that, we created unreasonable numbers of intermediate states. To improve matters, keep a list of already-created intermediate states associated with the state currently being considered by the outer loop; we can re-use such states to place all the new arcs leading to the same destination or source. I also took the trouble to redefine push() and pull() to have a less risky API: they no longer delete any state or arc that the caller might possibly have a pointer to, except for the specifically-passed constraint arc. This reduces the risk of re-introducing the same type of error seen in the failed patch for CVE-2007-4772. Back-patch to all supported branches.	2015-10-16 15:55:59 -04:00
Tom Lane	f5b7d103bc	Improve performance of fixempties() pass in regular-expression compiler. The previous coding took something like O(N^4) time to fully process a chain of N EMPTY arcs. We can't really do much better than O(N^2) because we have to insert about that many arcs, but we can do lots better than what's there now. The win comes partly from using mergeins() to amortize de-duplication of arcs across multiple source states, and partly from exploiting knowledge of the ordering of arcs for each state to avoid looking at arcs we don't need to consider during the scan. We do have to be a bit careful of the possible reordering of arcs introduced by the sort-merge coding of the previous commit, but that's not hard to deal with. Back-patch to all supported branches.	2015-10-16 15:55:59 -04:00
Tom Lane	579840ca05	Fix O(N^2) performance problems in regular-expression compiler. Change the singly-linked in-arc and out-arc lists to be doubly-linked, so that arc deletion is constant time rather than having worst-case time proportional to the number of other arcs on the connected states. Modify the bulk arc transfer operations copyins(), copyouts(), moveins(), moveouts() so that they use a sort-and-merge algorithm whenever there's more than a small number of arcs to be copied or moved. The previous method is O(N^2) in the number of arcs involved, because it performs duplicate checking independently for each copied arc. The new method may change the ordering of existing arcs for the destination state, but nothing really cares about that. Provide another bulk arc copying method mergeins(), which is unused as of this commit but is needed for the next one. It basically is like copyins(), but the source arcs might not all come from the same state. Replace the O(N^2) bubble-sort algorithm used in carcsort() with a qsort() call. These changes greatly improve the performance of regex compilation for large or complex regexes, at the cost of extra space for arc storage during compilation. The original tradeoff was probably fine when it was made, but now we care more about speed and less about memory consumption. Back-patch to all supported branches.	2015-10-16 15:55:59 -04:00
Tom Lane	48789c5d23	Fix regular-expression compiler to handle loops of constraint arcs. It's possible to construct regular expressions that contain loops of constraint arcs (that is, ^ $ AHEAD BEHIND or LACON arcs). There's no use in fully traversing such a loop at execution, since you'd just end up in the same NFA state without having consumed any input. Worse, such a loop leads to infinite looping in the pullback/pushfwd stage of compilation, because we keep pushing or pulling the same constraints around the loop in a vain attempt to move them to the pre or post state. Such looping was previously recognized in CVE-2007-4772; but the fix only handled the case of trivial single-state loops (that is, a constraint arc leading back to its source state) ... and not only that, it was incorrect even for that case, because it broke the admittedly-not-very-clearly-stated API contract of the pull() and push() subroutines. The first two regression test cases added by this commit exhibit patterns that result in assertion failures because of that (though there seem to be no ill effects in non-assert builds). The other new test cases exhibit multi-state constraint loops; in an unpatched build they will run until the NFA state-count limit is exceeded. To fix, remove the code added for CVE-2007-4772, and instead create a general-purpose constraint-loop-breaking phase of regex compilation that executes before we do pullback/pushfwd. Since we never need to traverse a constraint loop fully, we can just break the loop at any chosen spot, if we add clone states that can replicate any sequence of arc transitions that would've traversed just part of the loop. Also add some commentary clarifying why we have to have all these machinations in the first place. This class of problems has been known for some time --- we had a report from Marc Mamin about two years ago, for example, and there are related complaints in the Tcl bug tracker. I had discussed a fix of this kind off-list with Henry Spencer, but didn't get around to doing something about it until the issue was rediscovered by Greg Stark recently. Back-patch to all supported branches.	2015-10-16 15:55:58 -04:00
Tom Lane	b63fc28776	Add recursion depth protections to regular expression matching. Some of the functions in regex compilation and execution recurse, and therefore could in principle be driven to stack overflow. The Tcl crew has seen this happen in practice in duptraverse(), though their fix was to put in a hard-wired limit on the number of recursive levels, which is not too appetizing --- fortunately, we have enough infrastructure to check the actually available stack. Greg Stark has also seen it in other places while fuzz testing on a machine with limited stack space. Let's put guards in to prevent crashes in all these places. Since the regex code would leak memory if we simply threw elog(ERROR), we have to introduce an API that checks for stack depth without throwing such an error. Fortunately that's not difficult.	2015-10-02 14:51:58 -04:00
Tom Lane	d9c0c728af	Fix low-probability memory leak in regex execution. After an internal failure in shortest() or longest() while pinning down the exact location of a match, find() forgot to free the DFA structure before returning. This is pretty unlikely to occur, since we just successfully ran the "search" variant of the DFA; but it could happen, and it would result in a session-lifespan memory leak since this code uses malloc() directly. Problem seems to have been aboriginal in Spencer's library, so back-patch all the way. In passing, correct a thinko in a comment I added awhile back about the meaning of the "ntree" field. I happened across these issues while comparing our code to Tcl's version of the library.	2015-09-18 13:55:17 -04:00
Tom Lane	b44d92b67b	Sync regex code with Tcl 8.6.4. Sync our regex code with upstream changes since last time we did this, which was Tcl 8.5.11 (see commit `08fd6ff37f`). The only functional change here is to disbelieve that an octal escape is three digits long if it would exceed \377. That's a bug fix, but it's a minor one and could change the interpretation of working regexes, so don't back-patch. In addition to that, s/INFINITY/DUPINF/ to eliminate the risk of collisions with <math.h>'s macro, and s/LOCAL/NOPROP/ because that also seems like an unnecessarily collision-prone macro name. There were some other cosmetic changes in their copy that I did not adopt, notably a rather half-hearted attempt at renaming some of the C functions in a more verbose style. (I'm not necessarily against the concept, but renaming just a few functions in the package is not an improvement.)	2015-09-16 15:25:25 -04:00
Tom Lane	91cf3135b9	Fix minor bug in regexp makesearch() function. The list-wrangling here was done wrong, allowing the same state to get put into the list twice. The following loop then would clone it twice. The second clone would wind up with no inarcs, so that there was no observable misbehavior AFAICT, but a useless state in the finished NFA isn't an especially good thing.	2015-09-09 20:14:58 -04:00
Tom Lane	8a0258c318	Fix some possible low-memory failures in regexp compilation. newnfa() failed to set the regex error state when malloc() fails. Several places in regcomp.c failed to check for an error after calling subre(). Each of these mistakes could lead to null-pointer-dereference crashes in memory-starved backends. Report and patch by Andreas Seltenreich. Back-patch to all branches.	2015-08-12 00:48:11 -04:00
Tom Lane	1567e659a8	Fix two low-probability memory leaks in regular expression parsing. If pg_regcomp failed after having invoked markst/cleanst, it would leak any "struct subre" nodes it had created. (We've already detected all regex syntax errors at that point, so the only likely causes of later failure would be query cancel or out-of-memory.) To fix, make sure freesrnode knows the difference between the pre-cleanst and post-cleanst cleanup procedures. Add some documentation of this less-than-obvious point. Also, newlacon did the wrong thing with an out-of-memory failure from realloc(), so that the previously allocated array would be leaked. Both of these are pretty low-probability scenarios, but a bug is a bug, so patch all the way back. Per bug #10976 from Arthur O'Dwyer.	2014-07-18 13:00:27 -04:00
Bruce Momjian	0a78320057	pgindent run for 9.4 This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.	2014-05-06 12:12:18 -04:00
Tom Lane	9662143f0c	Allow regex operations to be terminated early by query cancel requests. The regex code didn't have any provision for query cancel; which is unsurprising given its non-Postgres origin, but still problematic since some operations can take a long time. Introduce a callback function to check for a pending query cancel or session termination request, and call it in a couple of strategic spots where we can make the regex code exit with an error indicator. If we ever actually split out the regex code as a standalone library, some additional work will be needed to let the cancel callback function be specified externally to the library. But that's straightforward (certainly so by comparison to putting the locale-dependent character classification logic on a similar arms-length basis), and there seems no need to do it right now. A bigger issue is that there may be more places than these two where we need to check for cancels. We can always add more checks later, now that the infrastructure is in place. Since there are known examples of not-terribly-long regexes that can lock up a backend for a long time, back-patch to all supported branches. I have hopes of fixing the known performance problems later, but adding query cancel ability seems like a good idea even if they were all fixed.	2014-03-01 15:20:56 -05:00
Tom Lane	0d79c0a8cc	Make various variables const (read-only). These changes should generally improve correctness/maintainability. A nice side benefit is that several kilobytes move from initialized data to text segment, allowing them to be shared across processes and probably reducing copy-on-write overhead while forking a new backend. Unfortunately this doesn't seem to help libpq in the same way (at least not when it's compiled with -fpic on x86_64), but we can hope the linker at least collects all nominally-const data together even if it's not actually part of the text segment. Also, make pg_encname_tbl[] static in encnames.c, since there seems no very good reason for any other code to use it; per a suggestion from Wim Lewis, who independently submitted a patch that was mostly a subset of this one. Oskari Saarenmaa, with some editorialization by me	2014-01-18 16:04:32 -05:00
Tom Lane	a7b61d4f5a	Fix infinite-loop risk in fixempties() stage of regex compilation. The previous coding of this function could get into situations where it would never terminate, because successive passes would re-add EMPTY arcs that had been removed by the previous pass. Rewrite the function completely using a new algorithm that is guaranteed to terminate, and also seems to be usually faster than the old one. Per Tcl bugs 3604074 and 3606683. Tom Lane and Don Porter	2013-03-07 11:51:03 -05:00
Tom Lane	73dc003bee	Add missing error check in regexp parser. parseqatom() failed to check for an error return (NULL result) from its recursive call to parsebranch(), and in consequence could crash with a null-pointer dereference after an error return. This bug has been there since day one, but wasn't noticed before, probably because most error cases in parsebranch() didn't actually lead to returning NULL. Add the missing error check, and also tweak parsebranch() to exit in a less indirect fashion after a call to parseqatom() fails. Report by Tomasz Karlik, fix by me.	2013-02-27 10:40:03 -05:00
Tom Lane	54fd196ffc	Prevent corner-case core dump in rfree(). rfree() failed to cope with the case that pg_regcomp() had initialized the regex_t struct but then failed to allocate any memory for re->re_guts (ie, the first malloc call in pg_regcomp() failed). It would try to touch the guts struct anyway, and thus dump core. This is a sufficiently narrow corner case that it's not surprising it's never been seen in the field; but still a bug is a bug, so patch all active branches. Noted while investigating whether we need to call pg_regfree after a failure return from pg_regcomp. Other than this bug, it turns out we don't, so adjust comments appropriately.	2012-07-15 13:27:54 -04:00
Tom Lane	c6aae3042b	Simplify and document regex library's compact-NFA representation. The previous coding abused the first element of a cNFA state's arcs list to hold a per-state flag bit, which was confusing, undocumented, and not even particularly efficient. Get rid of that in favor of a separate "stflags" vector. Since there's only one bit in use, I chose to allocate a char per state; we could possibly replace this with a bitmap at some point, but that would make accesses a little slower. It's already about 8X smaller than before, so let's not get overly tense. Also document the representation better than it was before, which is to say not at all. This patch is a byproduct of investigations towards extracting a "fixed prefix" string from the compact-NFA representation of regex patterns. Might need to back-patch it if we decide to back-patch that fix, but for now it's just code cleanup so I'll just put it in HEAD.	2012-07-07 17:39:50 -04:00
Bruce Momjian	927d61eeff	Run pgindent on 9.2 source tree in preparation for first 9.3 commit-fest.	2012-06-10 15:20:04 -04:00
Tom Lane	3cbfe485e4	Remove useless "retry memory" logic within regex engine. Apparently some primordial version of Spencer's engine needed cdissect() and child functions to be able to continue matching from a previous position when re-called. That is dead code, though, since trivial inspection shows that cdissect can never be entered without having previously done zapmem which resets the relevant retry counter. I have also verified experimentally that no case in the Tcl regression tests reaches cdissect with a nonzero retry value. Accordingly, remove that logic. This doesn't really save any noticeable number of cycles in itself, but it is one step towards making dissect() and cdissect() equivalent, which will allow removing hundreds of lines of near-duplicated code. Since struct subre's "retry" field is no longer particularly related to any kind of retry, rename it to "id". As of this commit it's only used for identifying a subre node in debug printouts, so you might think we should get rid of the field entirely; but I have a plan for another use.	2012-02-24 18:40:28 -05:00
Tom Lane	173e29aa5d	Fix the general case of quantified regex back-references. Cases where a back-reference is part of a larger subexpression that is quantified have never worked in Spencer's regex engine, because he used a compile-time transformation that neglected the need to check the back-reference match in iterations before the last one. (That was okay for capturing parens, and we still do it if the regex has only capturing parens ... but it's not okay for backrefs.) To make this work properly, we have to add an "iteration" node type to the regex engine's vocabulary of sub-regex nodes. Since this is a moderately large change with a fair risk of introducing new bugs of its own, apply to HEAD only, even though it's a fix for a longstanding bug.	2012-02-24 01:41:03 -05:00
Tom Lane	5223f96d92	Fix regex back-references that are directly quantified with . The syntax "\n", that is a backref with a * quantifier directly applied to it, has never worked correctly in Spencer's library. This has been an open bug in the Tcl bug tracker since 2005: https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894 The core of the problem is in parseqatom(), which first changes "\n" to "\n+\|" and then applies repeat() to the NFA representing the backref atom. repeat() thinks that any arc leading into its "rp" argument is part of the sub-NFA to be repeated. Unfortunately, since parseqatom() already created the arc that was intended to represent the empty bypass around "\n+", this arc gets moved too, so that it now leads into the state loop created by repeat(). Thus, what was supposed to be an "empty" bypass gets turned into something that represents zero or more repetitions of the NFA representing the backref atom. In the original example, in place of ^([bc])\1$ we now have something that acts like ^([bc])(\1+\|[bc])$ At runtime, the branch involving the actual backref fails, as it's supposed to, but then the other branch succeeds anyway. We could no doubt fix this by some rearrangement of the operations in parseqatom(), but that code is plenty ugly already, and what's more the whole business of converting "x" to "x+\|" probably needs to go away to fix another problem I'll mention in a moment. Instead, this patch suppresses the *-conversion when the target is a simple backref atom, leaving the case of m == 0 to be handled at runtime. This makes the patch in regcomp.c a one-liner, at the cost of having to tweak cbrdissect() a little. In the event I went a bit further than that and rewrote cbrdissect() to check all the string-length-related conditions before it starts comparing characters. It seems a bit stupid to possibly iterate through many copies of an n-character backreference, only to fail at the end because the target string's length isn't a multiple of n --- we could have found that out before starting. The existing coding could only be a win if integer division is hugely expensive compared to character comparison, but I don't know of any modern machine where that might be true. This does not fix all the problems with quantified back-references. In particular, the code is still broken for back-references that appear within a larger expression that is quantified (so that direct insertion of the quantification limits into the BACKREF node doesn't apply). I think fixing that will take some major surgery on the NFA code, specifically introducing an explicit iteration node type instead of trying to transform iteration into concatenation of modified regexps. Back-patch to all supported branches. In HEAD, also add a regression test case for this. (It may seem a bit silly to create a regression test file for just one test case; but I'm expecting that we will soon import a whole bunch of regex regression tests from Tcl, so might as well create the infrastructure now.)	2012-02-20 00:52:33 -05:00
Tom Lane	27af91438b	Create the beginnings of internals documentation for the regex code. Create src/backend/regex/README to hold an implementation overview of the regex package, and fill it in with some preliminary notes about the code's DFA/NFA processing and colormap management. Much more to do there of course. Also, improve some code comments around the colormap and cvec code. No functional changes except to add one missing assert.	2012-02-19 18:58:23 -05:00
Tom Lane	1e16a8107d	Teach regular expression operators to honor collations. This involves getting the character classification and case-folding functions in the regex library to use the collations infrastructure. Most of this work had been done already in connection with the upper/lower and LIKE logic, so it was a simple matter of transposition. While at it, split out these functions into a separate source file regc_pg_locale.c, so that they can be correctly labeled with the Postgres project's license rather than the Scriptics license. These functions are 100% Postgres-written code whereas what remains in regc_locale.c is still mostly not ours, so lumping them both under the same copyright notice was getting more and more misleading.	2011-04-10 18:03:09 -04:00
Bruce Momjian	bf50caf105	pgindent run before PG 9.1 beta 1.	2011-04-10 11:42:00 -04:00
Magnus Hagander	9f2e211386	Remove cvs keywords from all files.	2010-09-20 22:08:53 +02:00
Tom Lane	e621037eec	Tweak a couple of macros in the regex code to suppress compiler warnings from "clang". The VERR changes make an assignment unconditional, which is probably easier to read/understand anyway, and one can hardly argue that it's worth shaving cycles off the case of reporting another error when one has already been detected. The INSIST change limits where that macro can be used, but not in a way that creates a problem for any existing call.	2010-08-02 02:29:39 +00:00
Bruce Momjian	65e806cba1	pgindent run for 9.0	2010-02-26 02:01:40 +00:00
Tom Lane	3e51ae491d	Fix some comments that got mangled by pgindent.	2010-01-30 04:18:00 +00:00
Tom Lane	df1e965e12	Sync our regex code with upstream changes since last time we did this, which was Tcl 8.4.8. The main changes are to remove the never-fully-implemented code for multi-character collating elements, and to const-ify some stuff a bit more fully. In combination with the recent security patch, this commit brings us into line with Tcl 8.5.0. Note that I didn't make any effort to duplicate a lot of cosmetic changes that they made to bring their copy into line with their own style guidelines, such as adding braces around single-line IF bodies. Most of those we either had done already (such as ANSI-fication of function headers) or there is no point because pgindent would undo the change anyway.	2008-02-14 17:33:37 +00:00
Tom Lane	06ce02f989	Adjust some regex debugging printouts to not give wrong-format-width warnings on a 64-bit machine. Noted while chasing a recent regex bug report.	2007-10-06 16:05:54 +00:00
Bruce Momjian	1dc3498251	Standard pgindent run for 8.1.	2005-10-15 02:49:52 +00:00
Bruce Momjian	b492c3accc	Add parentheses to macros when args are used in computations. Without them, the executation behavior could be unexpected.	2005-05-25 21:40:43 +00:00
Tom Lane	d4c4d28427	Install Tcl regex fixes to sync our regex engine with Tcl 8.4.8 (up from 8.4.1). This corrects some curious regex bugs, though not the greediness issue I was hoping to find a solution for :-(	2004-11-24 22:56:54 +00:00
Tom Lane	0bd61548ab	Solve the 'Turkish problem' with undesirable locale behavior for case conversion of basic ASCII letters. Remove all uses of strcasecmp and strncasecmp in favor of new functions pg_strcasecmp and pg_strncasecmp; remove most but not all direct uses of toupper and tolower in favor of pg_toupper and pg_tolower. These functions use the same notions of case folding already developed for identifier case conversion. I left the straight locale-based folding in place for situations where we are just manipulating user data and not trying to match it to built-in strings --- for example, the SQL upper() function is still locale dependent. Perhaps this will prove not to be what's wanted, but at the moment we can initdb and pass regression tests in Turkish locale.	2004-05-07 00:24:59 +00:00
PostgreSQL Daemon	969685ad44	$Header: -> $PostgreSQL Changes ...	2003-11-29 19:52:15 +00:00
Tom Lane	5594aa6a6e	Fix broken definition of :print: character class, per Bruno Wolff. Also, make :alnum: character class directly dependent on isalnum() rather than guessing.	2003-09-29 00:21:58 +00:00
Bruce Momjian	46785776c4	Another pgindent run with updated typedefs.	2003-08-08 21:42:59 +00:00

1 2

87 Commits