Commit Graph

54 Commits

Author SHA1 Message Date
Tom Lane 5223f96d92 Fix regex back-references that are directly quantified with *.
The syntax "\n*", that is a backref with a * quantifier directly applied
to it, has never worked correctly in Spencer's library.  This has been an
open bug in the Tcl bug tracker since 2005:
https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894

The core of the problem is in parseqatom(), which first changes "\n*" to
"\n+|" and then applies repeat() to the NFA representing the backref atom.
repeat() thinks that any arc leading into its "rp" argument is part of the
sub-NFA to be repeated.  Unfortunately, since parseqatom() already created
the arc that was intended to represent the empty bypass around "\n+", this
arc gets moved too, so that it now leads into the state loop created by
repeat().  Thus, what was supposed to be an "empty" bypass gets turned into
something that represents zero or more repetitions of the NFA representing
the backref atom.  In the original example, in place of
	^([bc])\1*$
we now have something that acts like
	^([bc])(\1+|[bc]*)$
At runtime, the branch involving the actual backref fails, as it's supposed
to, but then the other branch succeeds anyway.

We could no doubt fix this by some rearrangement of the operations in
parseqatom(), but that code is plenty ugly already, and what's more the
whole business of converting "x*" to "x+|" probably needs to go away to fix
another problem I'll mention in a moment.  Instead, this patch suppresses
the *-conversion when the target is a simple backref atom, leaving the case
of m == 0 to be handled at runtime.  This makes the patch in regcomp.c a
one-liner, at the cost of having to tweak cbrdissect() a little.  In the
event I went a bit further than that and rewrote cbrdissect() to check all
the string-length-related conditions before it starts comparing characters.
It seems a bit stupid to possibly iterate through many copies of an
n-character backreference, only to fail at the end because the target
string's length isn't a multiple of n --- we could have found that out
before starting.  The existing coding could only be a win if integer
division is hugely expensive compared to character comparison, but I don't
know of any modern machine where that might be true.

This does not fix all the problems with quantified back-references.  In
particular, the code is still broken for back-references that appear within
a larger expression that is quantified (so that direct insertion of the
quantification limits into the BACKREF node doesn't apply).  I think fixing
that will take some major surgery on the NFA code, specifically introducing
an explicit iteration node type instead of trying to transform iteration
into concatenation of modified regexps.

Back-patch to all supported branches.  In HEAD, also add a regression test
case for this.  (It may seem a bit silly to create a regression test file
for just one test case; but I'm expecting that we will soon import a whole
bunch of regex regression tests from Tcl, so might as well create the
infrastructure now.)
2012-02-20 00:52:33 -05:00
Tom Lane 27af91438b Create the beginnings of internals documentation for the regex code.
Create src/backend/regex/README to hold an implementation overview of
the regex package, and fill it in with some preliminary notes about
the code's DFA/NFA processing and colormap management.  Much more to
do there of course.

Also, improve some code comments around the colormap and cvec code.
No functional changes except to add one missing assert.
2012-02-19 18:58:23 -05:00
Tom Lane 1e16a8107d Teach regular expression operators to honor collations.
This involves getting the character classification and case-folding
functions in the regex library to use the collations infrastructure.
Most of this work had been done already in connection with the upper/lower
and LIKE logic, so it was a simple matter of transposition.

While at it, split out these functions into a separate source file
regc_pg_locale.c, so that they can be correctly labeled with the Postgres
project's license rather than the Scriptics license.  These functions are
100% Postgres-written code whereas what remains in regc_locale.c is still
mostly not ours, so lumping them both under the same copyright notice was
getting more and more misleading.
2011-04-10 18:03:09 -04:00
Bruce Momjian bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Tom Lane e621037eec Tweak a couple of macros in the regex code to suppress compiler warnings
from "clang".  The VERR changes make an assignment unconditional, which is
probably easier to read/understand anyway, and one can hardly argue that
it's worth shaving cycles off the case of reporting another error when
one has already been detected.  The INSIST change limits where that macro
can be used, but not in a way that creates a problem for any existing call.
2010-08-02 02:29:39 +00:00
Bruce Momjian 65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Tom Lane 3e51ae491d Fix some comments that got mangled by pgindent. 2010-01-30 04:18:00 +00:00
Tom Lane df1e965e12 Sync our regex code with upstream changes since last time we did this, which
was Tcl 8.4.8.  The main changes are to remove the never-fully-implemented
code for multi-character collating elements, and to const-ify some stuff a
bit more fully.  In combination with the recent security patch, this commit
brings us into line with Tcl 8.5.0.

Note that I didn't make any effort to duplicate a lot of cosmetic changes
that they made to bring their copy into line with their own style
guidelines, such as adding braces around single-line IF bodies.  Most of
those we either had done already (such as ANSI-fication of function headers)
or there is no point because pgindent would undo the change anyway.
2008-02-14 17:33:37 +00:00
Tom Lane 06ce02f989 Adjust some regex debugging printouts to not give wrong-format-width
warnings on a 64-bit machine.  Noted while chasing a recent regex
bug report.
2007-10-06 16:05:54 +00:00
Bruce Momjian 1dc3498251 Standard pgindent run for 8.1. 2005-10-15 02:49:52 +00:00
Bruce Momjian b492c3accc Add parentheses to macros when args are used in computations. Without
them, the executation behavior could be unexpected.
2005-05-25 21:40:43 +00:00
Tom Lane d4c4d28427 Install Tcl regex fixes to sync our regex engine with Tcl 8.4.8 (up from
8.4.1).  This corrects some curious regex bugs, though not the greediness
issue I was hoping to find a solution for :-(
2004-11-24 22:56:54 +00:00
Tom Lane 0bd61548ab Solve the 'Turkish problem' with undesirable locale behavior for case
conversion of basic ASCII letters.  Remove all uses of strcasecmp and
strncasecmp in favor of new functions pg_strcasecmp and pg_strncasecmp;
remove most but not all direct uses of toupper and tolower in favor of
pg_toupper and pg_tolower.  These functions use the same notions of
case folding already developed for identifier case conversion.  I left
the straight locale-based folding in place for situations where we are
just manipulating user data and not trying to match it to built-in
strings --- for example, the SQL upper() function is still locale
dependent.  Perhaps this will prove not to be what's wanted, but at
the moment we can initdb and pass regression tests in Turkish locale.
2004-05-07 00:24:59 +00:00
PostgreSQL Daemon 969685ad44 $Header: -> $PostgreSQL Changes ... 2003-11-29 19:52:15 +00:00
Tom Lane 5594aa6a6e Fix broken definition of :print: character class, per Bruno Wolff.
Also, make :alnum: character class directly dependent on isalnum()
rather than guessing.
2003-09-29 00:21:58 +00:00
Bruce Momjian 46785776c4 Another pgindent run with updated typedefs. 2003-08-08 21:42:59 +00:00
Bruce Momjian 089003fb46 pgindent run. 2003-08-04 00:43:34 +00:00
Tom Lane 7bcc6d98fb Replace regular expression package with Henry Spencer's latest version
(extracted from Tcl 8.4.1 release, as Henry still hasn't got round to
making it a separate library).  This solves a performance problem for
multibyte, as well as upgrading our regexp support to match recent Tcl
and nearly match recent Perl.
2003-02-05 17:41:33 +00:00
Bruce Momjian bea4792125 This patch removes a bunch of superfluous #include directives: if
postgres.h or c.h includes a system header (such as stdio.h or
stdlib.h), there's no need to specifically include it in any of the .c
files in the backend.

Neil Conway
2002-11-08 20:23:57 +00:00
Bruce Momjian e50f52a074 pgindent run. 2002-09-04 20:31:48 +00:00
Bruce Momjian 97ac103289 Remove sys/types.h in files that include postgres.h, and hence c.h,
because c.h has sys/types.h.
2002-09-02 02:47:07 +00:00
Tatsuo Ishii ed7baeaf4d Remove #ifdef MULTIBYTE per hackers list discussion. 2002-08-29 07:22:30 +00:00
Thomas G. Lockhart ea01a451cc Implement SQL99 OVERLAY(). Allows substitution of a substring in a string.
Implement SQL99 SIMILAR TO as a synonym for our existing operator "~".
Implement SQL99 regular expression SUBSTRING(string FROM pat FOR escape).
 Extend the definition to make the FOR clause optional.
 Define textregexsubstr() to actually implement this feature.
Update the regression test to include these new string features.
 All tests pass.
Rename the regular expression support routines from "pg95_xxx" to "pg_xxx".
Define CREATE CHARACTER SET in the parser per SQL99. No implementation yet.
2002-06-11 15:44:38 +00:00
Tom Lane 3e48c66136 Fix code to work when isalpha and friends are macros, not functions. 2002-05-05 00:50:31 +00:00
Bruce Momjian f71c924f10 [ Patch comments in three pieces.]
Attached is a pacth against 7.2 which adds locale awareness to the
character classes of the regular expression engine.

...

> > I still think the xdigit class could be handled the same way the digit
> > class is (by enumeration rather than using the isxdigit function). That
> > saves you a cicle, and I don't think there's any loss.
>
> In fact, I will email you when I apply the original patch.

I miss that case :-(. Here is the pached patch.

...

Here is a patch which addresses Tatsuo's concerns (it does return an
static struct instead of constructing it).
2002-04-24 01:51:11 +00:00
Bruce Momjian 6783b2372e Another pgindent run. Fixes enum indenting, and improves #endif
spacing.  Also adds space for one-line comments.
2001-10-28 06:26:15 +00:00
Bruce Momjian b81844b173 pgindent run on all C files. Java run to follow. initdb/regression
tests pass.
2001-10-25 05:50:21 +00:00
Bruce Momjian 9e1552607a pgindent run. Make it all clean. 2001-03-22 04:01:46 +00:00
Tom Lane f7a839bc2b Clean up portability problems in regexp package: change all routine
definitions from K&R to ANSI C style, and fix broken assumption that
int and long are the same datatype.  This repairs problems observed
on Alpha with regexps having between 32 and 63 states.
2001-02-13 00:02:36 +00:00
Tom Lane a27b691e29 Ensure that all uses of <ctype.h> functions are applied to unsigned-char
values, whether the local char type is signed or not.  This is necessary
for portability.  Per discussion on pghackers around 9/16/00.
2000-12-03 20:45:40 +00:00
Peter Eisentraut 533d516629 Removed MBFLAGS from makefiles since it's now done in include/config.h. 2000-01-19 02:59:03 +00:00
Bruce Momjian a9591ce66a Change #include's to use <> and "" as appropriate. 1999-07-15 23:04:24 +00:00
Bruce Momjian 07842084fe pgindent run over code. 1999-05-25 16:15:34 +00:00
Tatsuo Ishii 235a569aaa Fix multi-byte+locale problem 1999-03-25 04:46:53 +00:00
Bruce Momjian fa1a8d6a97 OK, folks, here is the pgindent output. 1998-09-01 04:40:42 +00:00
Bruce Momjian af74855a60 Renaming cleanup, no pgindent yet. 1998-09-01 03:29:17 +00:00
Bruce Momjian 7b2b779a2a Add auto-size to screen to \d? commands. Use UNION to show all
\d? results in one query. Add \d? field search feature.  Rename MB
to MULTIBYTE.
1998-07-18 18:34:34 +00:00
Bruce Momjian cb7cbc16fa Hi, here are the patches to enhance existing MB handling. This time
I have implemented a framework of encoding translation between the
backend and the frontend. Also I have added a new variable setting
command:

SET CLIENT_ENCODING TO 'encoding';

Other features include:
	Latin1 support more 8 bit cleaness

See doc/README.mb for more details. Note that the pacthes are
against May 30 snapshot.

Tatsuo Ishii
1998-06-16 07:29:54 +00:00
Bruce Momjian 6bd323c6b3 Remove un-needed braces around single statements. 1998-06-15 19:30:31 +00:00
Bruce Momjian 645146260e Cleanup of compiler warnings. 1998-04-06 02:38:26 +00:00
Marc G. Fournier 661ecf3c48 From: t-ishii@sra.co.jp
Included are patches intended for allowing PostgreSQL to handle
multi-byte charachter sets such as EUC(Extende Unix Code), Unicode and
Mule internal code. With the MB patch you can use multi-byte character
sets in regexp and LIKE. The encoding system chosen is determined at
the compile time.

To enable the MB extension, you need to define a variable "MB" in
Makefile.global or in Makefile.custom. For further information please
take a look at README.mb under doc directory.

(Note that unlike "jp patch" I do not use modified GNU regexp any
more. I changed Henry Spencer's regexp coming with PostgreSQL.)
1998-03-15 07:39:04 +00:00
Bruce Momjian a32450a585 pgindent run before 6.3 release, with Thomas' requested changes. 1998-02-26 04:46:47 +00:00
Bruce Momjian 24cab6bd0d Goodbye register keyword. Compiler knows better. 1998-02-11 19:14:04 +00:00
Bruce Momjian 59f6a57e59 Used modified version of indent that understands over 100 typedefs. 1997-09-08 21:56:23 +00:00
Bruce Momjian 319dbfa736 Another PGINDENT run that changes variable indenting and case label indenting. Also static variable indenting. 1997-09-08 02:41:22 +00:00
Bruce Momjian 1ccd423235 Massive commit to run PGINDENT on all *.c and *.h files. 1997-09-07 05:04:48 +00:00
Bruce Momjian ea5b5357cd Remove more (void) and fix -Wall warnings. 1997-08-12 22:55:25 +00:00
Bruce Momjian edb58721b8 Fix pgproc names over 15 chars in output. Add strNcpy() function. remove some (void) casts that are unnecessary. 1997-08-12 20:16:25 +00:00
Bryan Henderson fa9c0fff36 Remove __P macro usage so it compiles without cdefs.h. 1996-12-15 09:21:37 +00:00