The previous coding abused the first element of a cNFA state's arcs list
to hold a per-state flag bit, which was confusing, undocumented, and not
even particularly efficient. Get rid of that in favor of a separate
"stflags" vector. Since there's only one bit in use, I chose to allocate a
char per state; we could possibly replace this with a bitmap at some point,
but that would make accesses a little slower. It's already about 8X
smaller than before, so let's not get overly tense.
Also document the representation better than it was before, which is to say
not at all.
This patch is a byproduct of investigations towards extracting a "fixed
prefix" string from the compact-NFA representation of regex patterns.
Might need to back-patch it if we decide to back-patch that fix, but for
now it's just code cleanup so I'll just put it in HEAD.
zaptreesubs() was coded to unconditionally reset a capture subre's
corresponding pmatch[] entry. However, in regexes without backrefs, that
array is caller-supplied and might not have as many entries as the regex
has capturing parens. So check the array length and do nothing if there
is no corresponding entry, much as subset() does. Failure to check this
resulted in a stack clobber in the case reported by Marko Kreen.
This bug appears to have been latent in the regex library from the
beginning. It was not exposed because find() called dissect() not
cdissect(), and the dissect() code path didn't ever call zaptreesubs()
(formerly zapmem()). When I unified dissect() and cdissect() in commit
4dd78bf37a, the problem was exposed.
Now that I've seen this, I'm rather suspicious that we might need to
back-patch it; but will refrain for now, for lack of evidence that
the case can be hit in the previous coding.
The "uncomplicated" case isn't materially less complicated than the full
case, certainly not enough so to justify duplicating nearly 500 lines
of code. The only extra work being done in the full path is zaptreesubs,
which is very cheap compared to everything else being done here, and
besides that I'm less than convinced that it's not needed in some cases
even without backrefs.
In nested sub-regex trees, lower-level nodes created DFAs and then
destroyed them again before exiting, which is a bit dumb considering that
the recursive search is likely to call those nodes again later. Instead
cache each created DFA until the end of pg_regexec(). This is basically a
space for time tradeoff, in that it might increase the maximum memory
usage. However, in most regex patterns there are not all that many subre
nodes, so not that many DFAs --- and in any case, the peak usage occurs
when reaching the bottom recursion level, and except for alternation cases
that's going to be the same anyway.
Apparently some primordial version of Spencer's engine needed cdissect()
and child functions to be able to continue matching from a previous
position when re-called. That is dead code, though, since trivial
inspection shows that cdissect can never be entered without having
previously done zapmem which resets the relevant retry counter. I have
also verified experimentally that no case in the Tcl regression tests
reaches cdissect with a nonzero retry value. Accordingly, remove that
logic. This doesn't really save any noticeable number of cycles in itself,
but it is one step towards making dissect() and cdissect() equivalent,
which will allow removing hundreds of lines of near-duplicated code.
Since struct subre's "retry" field is no longer particularly related to
any kind of retry, rename it to "id". As of this commit it's only used
for identifying a subre node in debug printouts, so you might think we
should get rid of the field entirely; but I have a plan for another use.
Cases where a back-reference is part of a larger subexpression that
is quantified have never worked in Spencer's regex engine, because
he used a compile-time transformation that neglected the need to
check the back-reference match in iterations before the last one.
(That was okay for capturing parens, and we still do it if the
regex has *only* capturing parens ... but it's not okay for backrefs.)
To make this work properly, we have to add an "iteration" node type
to the regex engine's vocabulary of sub-regex nodes. Since this is a
moderately large change with a fair risk of introducing new bugs of its
own, apply to HEAD only, even though it's a fix for a longstanding bug.
The syntax "\n*", that is a backref with a * quantifier directly applied
to it, has never worked correctly in Spencer's library. This has been an
open bug in the Tcl bug tracker since 2005:
https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894
The core of the problem is in parseqatom(), which first changes "\n*" to
"\n+|" and then applies repeat() to the NFA representing the backref atom.
repeat() thinks that any arc leading into its "rp" argument is part of the
sub-NFA to be repeated. Unfortunately, since parseqatom() already created
the arc that was intended to represent the empty bypass around "\n+", this
arc gets moved too, so that it now leads into the state loop created by
repeat(). Thus, what was supposed to be an "empty" bypass gets turned into
something that represents zero or more repetitions of the NFA representing
the backref atom. In the original example, in place of
^([bc])\1*$
we now have something that acts like
^([bc])(\1+|[bc]*)$
At runtime, the branch involving the actual backref fails, as it's supposed
to, but then the other branch succeeds anyway.
We could no doubt fix this by some rearrangement of the operations in
parseqatom(), but that code is plenty ugly already, and what's more the
whole business of converting "x*" to "x+|" probably needs to go away to fix
another problem I'll mention in a moment. Instead, this patch suppresses
the *-conversion when the target is a simple backref atom, leaving the case
of m == 0 to be handled at runtime. This makes the patch in regcomp.c a
one-liner, at the cost of having to tweak cbrdissect() a little. In the
event I went a bit further than that and rewrote cbrdissect() to check all
the string-length-related conditions before it starts comparing characters.
It seems a bit stupid to possibly iterate through many copies of an
n-character backreference, only to fail at the end because the target
string's length isn't a multiple of n --- we could have found that out
before starting. The existing coding could only be a win if integer
division is hugely expensive compared to character comparison, but I don't
know of any modern machine where that might be true.
This does not fix all the problems with quantified back-references. In
particular, the code is still broken for back-references that appear within
a larger expression that is quantified (so that direct insertion of the
quantification limits into the BACKREF node doesn't apply). I think fixing
that will take some major surgery on the NFA code, specifically introducing
an explicit iteration node type instead of trying to transform iteration
into concatenation of modified regexps.
Back-patch to all supported branches. In HEAD, also add a regression test
case for this. (It may seem a bit silly to create a regression test file
for just one test case; but I'm expecting that we will soon import a whole
bunch of regex regression tests from Tcl, so might as well create the
infrastructure now.)
While this doesn't save a huge amount of runtime, it still seems worth
doing, especially since I realized that the data copying I did in my first
draft was quite unnecessary. In this version, once we have the results
cached, getting them back for re-use is really very cheap.
Also, remove the hard-wired limitation to not consider wctype.h results for
character codes above 255. It turns out that we can't push the limit as
far up as I'd originally hoped, because the regex colormap code is not
efficient enough to cope very well with character classes containing many
thousand letters, which a Unicode locale is entirely capable of producing.
Still, we can push it up to U+7FF (which I chose as the limit of 2-byte
UTF8 characters), which will at least make Eastern Europeans happy pending
a better solution. Thus, this commit resolves the specific complaint in
bug #6457, but not the more general issue that letters of non-western
alphabets are mostly not recognized as matching [[:alpha:]].
Create src/backend/regex/README to hold an implementation overview of
the regex package, and fill it in with some preliminary notes about
the code's DFA/NFA processing and colormap management. Much more to
do there of course.
Also, improve some code comments around the colormap and cvec code.
No functional changes except to add one missing assert.
Sync our regex code with upstream changes since last time we did this,
which was Tcl 8.5.0 (see commit df1e965e12).
There are no functional changes here; the main point is just to lay down
a commit-log marker that somebody has looked at this recently, and to do
what we can to keep the two codebases comparable.
This involves getting the character classification and case-folding
functions in the regex library to use the collations infrastructure.
Most of this work had been done already in connection with the upper/lower
and LIKE logic, so it was a simple matter of transposition.
While at it, split out these functions into a separate source file
regc_pg_locale.c, so that they can be correctly labeled with the Postgres
project's license rather than the Scriptics license. These functions are
100% Postgres-written code whereas what remains in regc_locale.c is still
mostly not ours, so lumping them both under the same copyright notice was
getting more and more misleading.
Per C standard, these are semantically the same thing; but saying NULL
when you mean NULL is good for readability.
Marti Raudsepp, per results of INRIA's Coccinelle.
from "clang". The VERR changes make an assignment unconditional, which is
probably easier to read/understand anyway, and one can hardly argue that
it's worth shaving cycles off the case of reporting another error when
one has already been detected. The INSIST change limits where that macro
can be used, but not in a way that creates a problem for any existing call.
matching before recursing instead of after. The DFA match eliminates
unworkable midpoint choices a lot faster than the recursive check, in most
cases, so doing it first can speed things up; particularly in pathological
cases such as recently exhibited by Michael Glaesemann.
In addition, apply some cosmetic changes that were applied upstream (in the
Tcl project) at the same time, in order to sync with upstream version 1.15
of regexec.c.
Upstream apparently intends to backpatch this, so I will too. The
pathological behavior could be unpleasant if encountered in the field,
which seems to justify any risk of introducing new bugs.
Tom Lane, reviewed by Donal K. Fellows of Tcl project
locale-dependent character classification properly when the database encoding
is UTF8.
The previous coding worked okay in single-byte encodings, or in any case for
ASCII characters, but failed entirely on multibyte characters. The fix
assumes that the <wctype.h> functions use Unicode code points as the wchar
representation for Unicode, ie, wchar matches pg_wchar.
This is only a partial solution, since we're still stupid about non-ASCII
characters in multibyte encodings other than UTF8. The practical effect
of that is limited, however, since those cases are generally Far Eastern
glyphs for which concepts like case-folding don't apply anyway. Certainly
all or nearly all of the field reports of problems have been about UTF8.
A more general solution would require switching to the platform's wchar
representation for all regex operations; which is possible but would have
substantial disadvantages. Let's try this and see if it's sufficient in
practice.
was Tcl 8.4.8. The main changes are to remove the never-fully-implemented
code for multi-character collating elements, and to const-ify some stuff a
bit more fully. In combination with the recent security patch, this commit
brings us into line with Tcl 8.5.0.
Note that I didn't make any effort to duplicate a lot of cosmetic changes
that they made to bring their copy into line with their own style
guidelines, such as adding braces around single-line IF bodies. Most of
those we either had done already (such as ANSI-fication of function headers)
or there is no point because pgindent would undo the change anyway.
are shared with Tcl, since it's their code to begin with, and the patches
have been copied from Tcl 8.5.0. Problems:
CVE-2007-4769: Inadequate check on the range of backref numbers allows
crash due to out-of-bounds read.
CVE-2007-4772: Infinite loop in regex optimizer for pattern '($|^)*'.
CVE-2007-6067: Very slow optimizer cleanup for regex with a large NFA
representation, as well as crash if we encounter an out-of-memory condition
during NFA construction.
Part of the response to CVE-2007-6067 is to put a limit on the number of
states in the NFA representation of a regex. This seems needed even though
the within-the-code problems have been corrected, since otherwise the code
could try to use very large amounts of memory for a suitably-crafted regex,
leading to potential DOS by driving the system into swap, activating a kernel
OOM killer, etc.
Although there are certainly plenty of ways to drive the system into effective
DOS with poorly-written SQL queries, these problems seem worth treating as
security issues because many applications might accept regex search patterns
from untrustworthy sources.
Thanks to Will Drewry of Google for reporting these problems. Patches by Will
Drewry and Tom Lane.
Security: CVE-2007-4769, CVE-2007-4772, CVE-2007-6067
versions of gcc (I'm seeing it with Apple's gcc 4.0.1). I think the
reason we did not see this before was that the assert() macros in the
regex code were all no-ops till recently.
as we do (and upstream Tcl doesn't). The loop limit might be subject
to negotiation if anyone ever tries to do regex debugging in Far
Eastern languages, but for now 1000 seems plenty. CHR_MAX was right out :-(
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
comment line where output as too long, and update typedefs for /lib
directory. Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).
Backpatch to 8.1.X.
The specification of this function is as follows.
regexp_replace(source text, pattern text, replacement text, [flags
text])
returns text
Replace string that matches to regular expression in source text to
replacement text.
- pattern is regular expression pattern.
- replacement is replace string that can use '\1'-'\9', and '\&'.
'\1'-'\9': back reference to the n'th subexpression.
'\&' : entire matched string.
- flags can use the following values:
g: global (replace all)
i: ignore case
When the flags is not specified, case sensitive, replace the first
instance only.
Atsushi Ogawa
conversion of basic ASCII letters. Remove all uses of strcasecmp and
strncasecmp in favor of new functions pg_strcasecmp and pg_strncasecmp;
remove most but not all direct uses of toupper and tolower in favor of
pg_toupper and pg_tolower. These functions use the same notions of
case folding already developed for identifier case conversion. I left
the straight locale-based folding in place for situations where we are
just manipulating user data and not trying to match it to built-in
strings --- for example, the SQL upper() function is still locale
dependent. Perhaps this will prove not to be what's wanted, but at
the moment we can initdb and pass regression tests in Turkish locale.
(extracted from Tcl 8.4.1 release, as Henry still hasn't got round to
making it a separate library). This solves a performance problem for
multibyte, as well as upgrading our regexp support to match recent Tcl
and nearly match recent Perl.
postgres.h or c.h includes a system header (such as stdio.h or
stdlib.h), there's no need to specifically include it in any of the .c
files in the backend.
Neil Conway
Implement SQL99 SIMILAR TO as a synonym for our existing operator "~".
Implement SQL99 regular expression SUBSTRING(string FROM pat FOR escape).
Extend the definition to make the FOR clause optional.
Define textregexsubstr() to actually implement this feature.
Update the regression test to include these new string features.
All tests pass.
Rename the regular expression support routines from "pg95_xxx" to "pg_xxx".
Define CREATE CHARACTER SET in the parser per SQL99. No implementation yet.
Attached is a pacth against 7.2 which adds locale awareness to the
character classes of the regular expression engine.
...
> > I still think the xdigit class could be handled the same way the digit
> > class is (by enumeration rather than using the isxdigit function). That
> > saves you a cicle, and I don't think there's any loss.
>
> In fact, I will email you when I apply the original patch.
I miss that case :-(. Here is the pached patch.
...
Here is a patch which addresses Tatsuo's concerns (it does return an
static struct instead of constructing it).
definitions from K&R to ANSI C style, and fix broken assumption that
int and long are the same datatype. This repairs problems observed
on Alpha with regexps having between 32 and 63 states.
$(CC) $(CFLAGS) $(LDFLAGS) <object files> <extra-libraries> $(LIBS) -o $@
This form seemed to be the most portable, readable, and logical, but in any
case it's better than having a dozen different ones in the tree.
source directory. This involves mostly makefiles using $(srcdir) when they
might have used ".". (Regression tests don't work with this, yet.)
Sort out usage of CPPFLAGS, CFLAGS (and CXXFLAGS). Add "override" keyword
in most places, to preserve necessary flags even when the user overrode the
flags.
As Bruce mentioned, this is due to the conflict among changes we made.
Included patches should fix the problem(I changed all MB to
MULTIBYTE). Please let me know if you have further problem.
P.S. I did not include pathces to configure and gram.c to save the
file size(configure.in and gram.y modified).
From: t-ishii@sra.co.jp
Attached are patches to enhance the multi-byte support. (patches are
against 7/18 snapshot)
* determine encoding at initdb/createdb rather than compile time
Now initdb/createdb has an option to specify the encoding. Also, I
modified the syntax of CREATE DATABASE to accept encoding option. See
README.mb for more details.
For this purpose I have added new column "encoding" to pg_database.
Also pg_attribute and pg_class are changed to catch up the
modification to pg_database. Actually I haved added pg_database_mb.h,
pg_attribute_mb.h and pg_class_mb.h. These are used only when MB is
enabled. The reason having separate files is I couldn't find a way to
use ifdef or whatever in those files. I have to admit it looks
ugly. No way.
* support for PGCLIENTENCODING when issuing COPY command
commands/copy.c modified.
* support for SQL92 syntax "SET NAMES"
See gram.y.
* support for LATIN2-5
* add UNICODE regression test case
* new test suite for MB
New directory test/mb added.
* clean up source files
Basic idea is to have MB's own subdirectory for easier maintenance.
These are include/mb and backend/utils/mb.
I have implemented a framework of encoding translation between the
backend and the frontend. Also I have added a new variable setting
command:
SET CLIENT_ENCODING TO 'encoding';
Other features include:
Latin1 support more 8 bit cleaness
See doc/README.mb for more details. Note that the pacthes are
against May 30 snapshot.
Tatsuo Ishii
Hi, here are patches I promised (against 6.3.2):
* character_length(), position(), substring() are now aware of
multi-byte characters
* add octet_length()
* add --with-mb option to configure
* new regression tests for EUC_KR
(contributed by "Soonmyung. Hong" <hong@lunaris.hanmesoft.co.kr>)
* add some test cases to the EUC_JP regression test
* fix problem in regress/regress.sh in case of System V
* fix toupper(), tolower() to handle 8bit chars
note that:
o patches for both configure.in and configure are
included. maybe the one for configure is not necessary.
o pg_proc.h was modified to add octet_length(). I used OIDs
(1374-1379) for that. Please let me know if these numbers are not
appropriate.
Attached you'll find a (big) patch that fixes make dep and make
depend in all Makefiles where I found it to be appropriate.
It also removes the dependency in Makefile.global for NAMEDATALEN
and OIDNAMELEN by making backend/catalog/genbki.sh and bin/initdb/initdb.sh
a little smarter.
This no longer requires initdb.sh that is turned into initdb with
a sed script when installing Postgres, hence initdb.sh should be
renamed to initdb (after the patch has been applied :-) )
This patch is against the 6.3 sources, as it took a while to
complete.
Please review and apply,
Cheers,
Jeroen van Vianen
Included are patches intended for allowing PostgreSQL to handle
multi-byte charachter sets such as EUC(Extende Unix Code), Unicode and
Mule internal code. With the MB patch you can use multi-byte character
sets in regexp and LIKE. The encoding system chosen is determined at
the compile time.
To enable the MB extension, you need to define a variable "MB" in
Makefile.global or in Makefile.custom. For further information please
take a look at README.mb under doc directory.
(Note that unlike "jp patch" I do not use modified GNU regexp any
more. I changed Henry Spencer's regexp coming with PostgreSQL.)
Makefile.global.
End result, if all goes well, should allow for much easier porting, since
there will no longer be a concept of a "port". Most, if not everything,
*should* be determined by configure, or by the compiler itself. Still
work to be done though :)