76 Commits

Author SHA1 Message Date
Tom Lane
538b3b8b35 Improve memory-usage accounting in regular-expression compiler.
This code previously counted the number of NFA states it created, and
complained if a limit was exceeded, so as to prevent bizarre regex patterns
from consuming unreasonable time or memory.  That's fine as far as it went,
but the code paid no attention to how many arcs linked those states.  Since
regexes can be contrived that have O(N) states but will need O(N^2) arcs
after fixempties() processing, it was still possible to blow out memory,
and take a long time doing it too.  To fix, modify the bookkeeping to count
space used by both states and arcs.

I did not bother with including the "color map" in the accounting; it
can only grow to a few megabytes, which is not a lot in comparison to
what we're allowing for states+arcs (about 150MB on 64-bit machines
or half that on 32-bit machines).

Looking at some of the larger real-world regexes captured in the Tcl
regression test suite suggests that the most that is likely to be needed
for regexes found in the wild is under 10MB, so I believe that the current
limit has enough headroom to make it okay to keep it as a hard-wired limit.

In connection with this, redefine REG_ETOOBIG as meaning "regular
expression is too complex"; the previous wording of "nfa has too many
states" was already somewhat inapropos because of the error code's use
for stack depth overrun, and it was not very user-friendly either.

Back-patch to all supported branches.
2015-10-16 15:55:59 -04:00
Tom Lane
579840ca05 Fix O(N^2) performance problems in regular-expression compiler.
Change the singly-linked in-arc and out-arc lists to be doubly-linked,
so that arc deletion is constant time rather than having worst-case time
proportional to the number of other arcs on the connected states.

Modify the bulk arc transfer operations copyins(), copyouts(), moveins(),
moveouts() so that they use a sort-and-merge algorithm whenever there's
more than a small number of arcs to be copied or moved.  The previous
method is O(N^2) in the number of arcs involved, because it performs
duplicate checking independently for each copied arc.  The new method may
change the ordering of existing arcs for the destination state, but nothing
really cares about that.

Provide another bulk arc copying method mergeins(), which is unused as
of this commit but is needed for the next one.  It basically is like
copyins(), but the source arcs might not all come from the same state.

Replace the O(N^2) bubble-sort algorithm used in carcsort() with a qsort()
call.

These changes greatly improve the performance of regex compilation for
large or complex regexes, at the cost of extra space for arc storage during
compilation.  The original tradeoff was probably fine when it was made, but
now we care more about speed and less about memory consumption.

Back-patch to all supported branches.
2015-10-16 15:55:59 -04:00
Tom Lane
b63fc28776 Add recursion depth protections to regular expression matching.
Some of the functions in regex compilation and execution recurse, and
therefore could in principle be driven to stack overflow.  The Tcl crew
has seen this happen in practice in duptraverse(), though their fix was
to put in a hard-wired limit on the number of recursive levels, which is
not too appetizing --- fortunately, we have enough infrastructure to check
the actually available stack.  Greg Stark has also seen it in other places
while fuzz testing on a machine with limited stack space.  Let's put guards
in to prevent crashes in all these places.

Since the regex code would leak memory if we simply threw elog(ERROR),
we have to introduce an API that checks for stack depth without throwing
such an error.  Fortunately that's not difficult.
2015-10-02 14:51:58 -04:00
Tom Lane
d9c0c728af Fix low-probability memory leak in regex execution.
After an internal failure in shortest() or longest() while pinning down the
exact location of a match, find() forgot to free the DFA structure before
returning.  This is pretty unlikely to occur, since we just successfully
ran the "search" variant of the DFA; but it could happen, and it would
result in a session-lifespan memory leak since this code uses malloc()
directly.  Problem seems to have been aboriginal in Spencer's library,
so back-patch all the way.

In passing, correct a thinko in a comment I added awhile back about the
meaning of the "ntree" field.

I happened across these issues while comparing our code to Tcl's version
of the library.
2015-09-18 13:55:17 -04:00
Tom Lane
b44d92b67b Sync regex code with Tcl 8.6.4.
Sync our regex code with upstream changes since last time we did this,
which was Tcl 8.5.11 (see commit 08fd6ff37f71485e2fc04bc6ce07d2a483c36702).

The only functional change here is to disbelieve that an octal escape is
three digits long if it would exceed \377.  That's a bug fix, but it's
a minor one and could change the interpretation of working regexes, so
don't back-patch.

In addition to that, s/INFINITY/DUPINF/ to eliminate the risk of collisions
with <math.h>'s macro, and s/LOCAL/NOPROP/ because that also seems like
an unnecessarily collision-prone macro name.

There were some other cosmetic changes in their copy that I did not adopt,
notably a rather half-hearted attempt at renaming some of the C functions
in a more verbose style.  (I'm not necessarily against the concept, but
renaming just a few functions in the package is not an improvement.)
2015-09-16 15:25:25 -04:00
Tom Lane
23116d5437 Add a bit more commentary about regex's colormap tree data structure.
Per an off-list question from Piotr Stefaniak.
2015-05-24 12:40:38 -04:00
Bruce Momjian
4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Bruce Momjian
0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Tom Lane
9662143f0c Allow regex operations to be terminated early by query cancel requests.
The regex code didn't have any provision for query cancel; which is
unsurprising given its non-Postgres origin, but still problematic since
some operations can take a long time.  Introduce a callback function to
check for a pending query cancel or session termination request, and
call it in a couple of strategic spots where we can make the regex code
exit with an error indicator.

If we ever actually split out the regex code as a standalone library,
some additional work will be needed to let the cancel callback function
be specified externally to the library.  But that's straightforward
(certainly so by comparison to putting the locale-dependent character
classification logic on a similar arms-length basis), and there seems
no need to do it right now.

A bigger issue is that there may be more places than these two where
we need to check for cancels.  We can always add more checks later,
now that the infrastructure is in place.

Since there are known examples of not-terribly-long regexes that can
lock up a backend for a long time, back-patch to all supported branches.
I have hopes of fixing the known performance problems later, but adding
query cancel ability seems like a good idea even if they were all fixed.
2014-03-01 15:20:56 -05:00
Bruce Momjian
7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Tom Lane
6756c8ad30 Fix old typo in comment.
NFAs have children, but their individual states don't.
2013-10-29 15:34:18 -04:00
Tom Lane
3ccae48f44 Support indexing of regular-expression searches in contrib/pg_trgm.
This works by extracting trigrams from the given regular expression,
in generally the same spirit as the previously-existing support for
LIKE searches, though of course the details are far more complicated.

Currently, only GIN indexes are supported.  We might be able to make
it work with GiST indexes later.

The implementation includes adding API functions to backend/regex/
to provide a view of the search NFA created from a regular expression.
These functions are meant to be generic enough to be supportable in
a standalone version of the regex library, should that ever happen.

Alexander Korotkov, reviewed by Heikki Linnakangas and Tom Lane
2013-04-09 01:06:54 -04:00
Heikki Linnakangas
bf2b0a1478 Fix crash on compiling a regular expression with more than 32k colors.
Throw an error instead.

Backpatch to all supported branches.
2013-04-04 19:48:11 +03:00
Tom Lane
628cbb50ba Re-implement extraction of fixed prefixes from regular expressions.
To generate btree-indexable conditions from regex WHERE conditions (such as
WHERE indexed_col ~ '^foo'), we need to be able to identify any fixed
prefix that a regex might have; that is, find any string that must be a
prefix of all strings satisfying the regex.  We used to do that with
entirely ad-hoc code that looked at the source text of the regex.  It
didn't know very much about regex syntax, which mostly meant that it would
fail to identify some optimizable cases; but Viktor Rosenfeld reported that
it would produce actively wrong answers for quantified parenthesized
subexpressions, such as '^(foo)?bar'.  Rather than trying to extend the
ad-hoc code to cover this, let's get rid of it altogether in favor of
identifying prefixes by examining the compiled form of a regex.

To do this, I've added a new entry point "pg_regprefix" to the regex library;
hopefully it is defined in a sufficiently general fashion that it can remain
in the library when/if that code gets split out as a standalone project.

Since this bug has been there for a very long time, this fix needs to get
back-patched.  However it depends on some other recent commits (particularly
the addition of wchar-to-database-encoding conversion), so I'll commit this
separately and then go to work on back-porting the necessary fixes.
2012-07-10 14:54:37 -04:00
Tom Lane
c6aae3042b Simplify and document regex library's compact-NFA representation.
The previous coding abused the first element of a cNFA state's arcs list
to hold a per-state flag bit, which was confusing, undocumented, and not
even particularly efficient.  Get rid of that in favor of a separate
"stflags" vector.  Since there's only one bit in use, I chose to allocate a
char per state; we could possibly replace this with a bitmap at some point,
but that would make accesses a little slower.  It's already about 8X
smaller than before, so let's not get overly tense.

Also document the representation better than it was before, which is to say
not at all.

This patch is a byproduct of investigations towards extracting a "fixed
prefix" string from the compact-NFA representation of regex patterns.
Might need to back-patch it if we decide to back-patch that fix, but for
now it's just code cleanup so I'll just put it in HEAD.
2012-07-07 17:39:50 -04:00
Bruce Momjian
927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Tom Lane
587359479a Avoid repeated creation/freeing of per-subre DFAs during regex search.
In nested sub-regex trees, lower-level nodes created DFAs and then
destroyed them again before exiting, which is a bit dumb considering that
the recursive search is likely to call those nodes again later.  Instead
cache each created DFA until the end of pg_regexec().  This is basically a
space for time tradeoff, in that it might increase the maximum memory
usage.  However, in most regex patterns there are not all that many subre
nodes, so not that many DFAs --- and in any case, the peak usage occurs
when reaching the bottom recursion level, and except for alternation cases
that's going to be the same anyway.
2012-02-24 18:40:30 -05:00
Tom Lane
3cbfe485e4 Remove useless "retry memory" logic within regex engine.
Apparently some primordial version of Spencer's engine needed cdissect()
and child functions to be able to continue matching from a previous
position when re-called.  That is dead code, though, since trivial
inspection shows that cdissect can never be entered without having
previously done zapmem which resets the relevant retry counter.  I have
also verified experimentally that no case in the Tcl regression tests
reaches cdissect with a nonzero retry value.  Accordingly, remove that
logic.  This doesn't really save any noticeable number of cycles in itself,
but it is one step towards making dissect() and cdissect() equivalent,
which will allow removing hundreds of lines of near-duplicated code.

Since struct subre's "retry" field is no longer particularly related to
any kind of retry, rename it to "id".  As of this commit it's only used
for identifying a subre node in debug printouts, so you might think we
should get rid of the field entirely; but I have a plan for another use.
2012-02-24 18:40:28 -05:00
Tom Lane
173e29aa5d Fix the general case of quantified regex back-references.
Cases where a back-reference is part of a larger subexpression that
is quantified have never worked in Spencer's regex engine, because
he used a compile-time transformation that neglected the need to
check the back-reference match in iterations before the last one.
(That was okay for capturing parens, and we still do it if the
regex has *only* capturing parens ... but it's not okay for backrefs.)

To make this work properly, we have to add an "iteration" node type
to the regex engine's vocabulary of sub-regex nodes.  Since this is a
moderately large change with a fair risk of introducing new bugs of its
own, apply to HEAD only, even though it's a fix for a longstanding bug.
2012-02-24 01:41:03 -05:00
Tom Lane
27af91438b Create the beginnings of internals documentation for the regex code.
Create src/backend/regex/README to hold an implementation overview of
the regex package, and fill it in with some preliminary notes about
the code's DFA/NFA processing and colormap management.  Much more to
do there of course.

Also, improve some code comments around the colormap and cvec code.
No functional changes except to add one missing assert.
2012-02-19 18:58:23 -05:00
Tom Lane
1e16a8107d Teach regular expression operators to honor collations.
This involves getting the character classification and case-folding
functions in the regex library to use the collations infrastructure.
Most of this work had been done already in connection with the upper/lower
and LIKE logic, so it was a simple matter of transposition.

While at it, split out these functions into a separate source file
regc_pg_locale.c, so that they can be correctly labeled with the Postgres
project's license rather than the Scriptics license.  These functions are
100% Postgres-written code whereas what remains in regc_locale.c is still
mostly not ours, so lumping them both under the same copyright notice was
getting more and more misleading.
2011-04-10 18:03:09 -04:00
Magnus Hagander
9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Tom Lane
0d32342501 Teach the regular expression functions to do case-insensitive matching and
locale-dependent character classification properly when the database encoding
is UTF8.

The previous coding worked okay in single-byte encodings, or in any case for
ASCII characters, but failed entirely on multibyte characters.  The fix
assumes that the <wctype.h> functions use Unicode code points as the wchar
representation for Unicode, ie, wchar matches pg_wchar.

This is only a partial solution, since we're still stupid about non-ASCII
characters in multibyte encodings other than UTF8.  The practical effect
of that is limited, however, since those cases are generally Far Eastern
glyphs for which concepts like case-folding don't apply anyway.  Certainly
all or nearly all of the field reports of problems have been about UTF8.
A more general solution would require switching to the platform's wchar
representation for all regex operations; which is possible but would have
substantial disadvantages.  Let's try this and see if it's sufficient in
practice.
2009-12-01 21:00:24 +00:00
Tom Lane
ab61df9e52 Remove regex_flavor GUC, so that regular expressions are always "advanced"
style by default.  Per discussion, there seems to be hardly anything that
really relies on being able to change the regex flavor, so the ability to
select it via embedded options ought to be enough for any stragglers.
Also, if we didn't remove the GUC, we'd really be morally obligated to
mark the regex functions non-immutable, which'd possibly create performance
issues.
2009-10-21 20:38:58 +00:00
Bruce Momjian
d747140279 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list
provided by Andrew.
2009-06-11 14:49:15 +00:00
Magnus Hagander
ad6bf716ba Convert three more guc settings to enum type:
default_transaction_isolation, session_replication_role and regex_flavor.
2008-04-02 14:42:56 +00:00
Tom Lane
df1e965e12 Sync our regex code with upstream changes since last time we did this, which
was Tcl 8.4.8.  The main changes are to remove the never-fully-implemented
code for multi-character collating elements, and to const-ify some stuff a
bit more fully.  In combination with the recent security patch, this commit
brings us into line with Tcl 8.5.0.

Note that I didn't make any effort to duplicate a lot of cosmetic changes
that they made to bring their copy into line with their own style
guidelines, such as adding braces around single-line IF bodies.  Most of
those we either had done already (such as ANSI-fication of function headers)
or there is no point because pgindent would undo the change anyway.
2008-02-14 17:33:37 +00:00
Tom Lane
98f27aaef3 Fix assorted security-grade bugs in the regex engine. All of these problems
are shared with Tcl, since it's their code to begin with, and the patches
have been copied from Tcl 8.5.0.  Problems:

CVE-2007-4769: Inadequate check on the range of backref numbers allows
crash due to out-of-bounds read.
CVE-2007-4772: Infinite loop in regex optimizer for pattern '($|^)*'.
CVE-2007-6067: Very slow optimizer cleanup for regex with a large NFA
representation, as well as crash if we encounter an out-of-memory condition
during NFA construction.

Part of the response to CVE-2007-6067 is to put a limit on the number of
states in the NFA representation of a regex.  This seems needed even though
the within-the-code problems have been corrected, since otherwise the code
could try to use very large amounts of memory for a suitably-crafted regex,
leading to potential DOS by driving the system into swap, activating a kernel
OOM killer, etc.

Although there are certainly plenty of ways to drive the system into effective
DOS with poorly-written SQL queries, these problems seem worth treating as
security issues because many applications might accept regex search patterns
from untrustworthy sources.

Thanks to Will Drewry of Google for reporting these problems.  Patches by Will
Drewry and Tom Lane.

Security: CVE-2007-4769, CVE-2007-4772, CVE-2007-6067
2008-01-03 20:47:55 +00:00
Tom Lane
c4db0d9ae1 Adjust regcustom.h so that all those assert() calls in the regex package
are converted to Postgres Assert() macros, instead of using <assert.h>
as formerly.  No difference in production builds, but --enable-cassert
debug builds will get better coverage for regex testing.
2007-10-06 16:01:51 +00:00
Bruce Momjian
8b4ff8b6a1 Wording cleanup for error messages. Also change can't -> cannot.
Standard English uses "may", "can", and "might" in different ways:

        may - permission, "You may borrow my rake."

        can - ability, "I can lift that log."

        might - possibility, "It might rain today."

Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice.  Similarly, "It may crash" is better stated, "It might crash".
2007-02-01 19:10:30 +00:00
Bruce Momjian
1dc3498251 Standard pgindent run for 8.1. 2005-10-15 02:49:52 +00:00
Bruce Momjian
75a64eeb4b I made the patch that implements regexp_replace again.
The specification of this function is as follows.

regexp_replace(source text, pattern text, replacement text, [flags
text])
returns text

Replace string that matches to regular expression in source text to
replacement text.

 - pattern is regular expression pattern.
 - replacement is replace string that can use '\1'-'\9', and '\&'.
    '\1'-'\9': back reference to the n'th subexpression.
    '\&'     : entire matched string.
 - flags can use the following values:
    g: global (replace all)
    i: ignore case
    When the flags is not specified, case sensitive, replace the first
    instance only.

Atsushi Ogawa
2005-07-10 04:54:33 +00:00
Bruce Momjian
b492c3accc Add parentheses to macros when args are used in computations. Without
them, the executation behavior could be unexpected.
2005-05-25 21:40:43 +00:00
Tom Lane
0bd61548ab Solve the 'Turkish problem' with undesirable locale behavior for case
conversion of basic ASCII letters.  Remove all uses of strcasecmp and
strncasecmp in favor of new functions pg_strcasecmp and pg_strncasecmp;
remove most but not all direct uses of toupper and tolower in favor of
pg_toupper and pg_tolower.  These functions use the same notions of
case folding already developed for identifier case conversion.  I left
the straight locale-based folding in place for situations where we are
just manipulating user data and not trying to match it to built-in
strings --- for example, the SQL upper() function is still locale
dependent.  Perhaps this will prove not to be what's wanted, but at
the moment we can initdb and pass regression tests in Turkish locale.
2004-05-07 00:24:59 +00:00
PostgreSQL Daemon
55b113257c make sure the $Id tags are converted to $PostgreSQL as well ... 2003-11-29 22:41:33 +00:00
Bruce Momjian
46785776c4 Another pgindent run with updated typedefs. 2003-08-08 21:42:59 +00:00
Bruce Momjian
089003fb46 pgindent run. 2003-08-04 00:43:34 +00:00
Tom Lane
7bcc6d98fb Replace regular expression package with Henry Spencer's latest version
(extracted from Tcl 8.4.1 release, as Henry still hasn't got round to
making it a separate library).  This solves a performance problem for
multibyte, as well as upgrading our regexp support to match recent Tcl
and nearly match recent Perl.
2003-02-05 17:41:33 +00:00
Bruce Momjian
e50f52a074 pgindent run. 2002-09-04 20:31:48 +00:00
Tatsuo Ishii
ed7baeaf4d Remove #ifdef MULTIBYTE per hackers list discussion. 2002-08-29 07:22:30 +00:00
Thomas G. Lockhart
ea01a451cc Implement SQL99 OVERLAY(). Allows substitution of a substring in a string.
Implement SQL99 SIMILAR TO as a synonym for our existing operator "~".
Implement SQL99 regular expression SUBSTRING(string FROM pat FOR escape).
 Extend the definition to make the FOR clause optional.
 Define textregexsubstr() to actually implement this feature.
Update the regression test to include these new string features.
 All tests pass.
Rename the regular expression support routines from "pg95_xxx" to "pg_xxx".
Define CREATE CHARACTER SET in the parser per SQL99. No implementation yet.
2002-06-11 15:44:38 +00:00
Bruce Momjian
ea08e6cd55 New pgindent run with fixes suggested by Tom. Patch manually reviewed,
initdb/regression tests pass.
2001-11-05 17:46:40 +00:00
Bruce Momjian
6783b2372e Another pgindent run. Fixes enum indenting, and improves #endif
spacing.  Also adds space for one-line comments.
2001-10-28 06:26:15 +00:00
Bruce Momjian
b81844b173 pgindent run on all C files. Java run to follow. initdb/regression
tests pass.
2001-10-25 05:50:21 +00:00
Bruce Momjian
9e1552607a pgindent run. Make it all clean. 2001-03-22 04:01:46 +00:00
Tatsuo Ishii
a0fa71837a Add _REGEX_UTILS_H to avoid duplication. 2001-02-22 04:35:00 +00:00
Tom Lane
f7a839bc2b Clean up portability problems in regexp package: change all routine
definitions from K&R to ANSI C style, and fix broken assumption that
int and long are the same datatype.  This repairs problems observed
on Alpha with regexps having between 32 and 63 states.
2001-02-13 00:02:36 +00:00
Tom Lane
cf16733983 Hmm, this isn't used either. 2001-02-12 22:17:29 +00:00
Tom Lane
d40f4aeac8 Remove unused and largely-broken-anyway compatibility defs. 2001-02-12 22:13:06 +00:00
Tom Lane
d08741eab5 Restructure the key include files per recent pghackers discussion: there
are now separate files "postgres.h" and "postgres_fe.h", which are meant
to be the primary include files for backend .c files and frontend .c files
respectively.  By default, only include files meant for frontend use are
installed into the installation include directory.  There is a new make
target 'make install-all-headers' that adds the whole content of the
src/include tree to the installed fileset, for use by people who want to
develop server-side code without keeping the complete source tree on hand.
Cleaned up a whole lot of crufty and inconsistent header inclusions.
2001-02-10 02:31:31 +00:00