PCRE2 — Perl-compatible regular expressions (revised API)
When you call pcre2_match
(),
it makes use of an internal function called match
(). This calls itself recursively at
branch points in the pattern, in order to remember the state
of the match so that it can back up and try a different
alternative after a failure. As matching proceeds deeper and
deeper into the tree of possibilities, the recursion depth
increases. The match
() function
is also called in other circumstances, for example, whenever
a parenthesized sub-pattern is entered, and in certain cases
of repetition.
Not all calls of match
()
increase the recursion depth; for an item such as a* it may
be called several times at the same level, after matching
different numbers of a's. Furthermore, in a number of cases
where the result of the recursive call would immediately be
passed back as the result of the current call (a "tail
recursion"), the function is just restarted instead.
Each time the internal match
() function is called recursively, it
uses memory from the process stack. For certain kinds of
pattern and data, very large amounts of stack may be needed,
despite the recognition of "tail recursion". Note that if
PCRE2 is compiled with the -fsanitize=address option of the
GCC compiler, the stack requirements are greatly
increased.
The above comments apply when pcre2_match
() is run in its normal
interpretive manner. If the compiled pattern was processed by
pcre2_jit_compile
(), and
just-in-time compiling was successful, and the options passed
to pcre2_match
() were not
incompatible, the matching process uses the JIT-compiled code
instead of the match
()
function. In this case, the memory requirements are handled
entirely differently. See the pcre2jit(3) documentation
for details.
The pcre2_dfa_match
()
function operates in a different way to pcre2_match
(), and uses recursion only when
there is a regular expression recursion or subroutine call in
the pattern. This includes the processing of assertion and
"once-only" subpatterns, which are handled like subroutine
calls. Normally, these are never very deep, and the limit on
the complexity of pcre2_dfa_match
() is controlled by the
amount of workspace it is given. However, it is possible to
write patterns with runaway infinite recursions; such
patterns will cause pcre2_dfa_match
() to run out of stack. At
present, there is no protection against this.
The comments that follow do NOT apply to pcre2_dfa_match
(); they are relevant only
for pcre2_match
() without the
JIT optimization.
pcre2_match
()'s
stack usageYou can often reduce the amount of recursion, and therefore the amount of stack used, by modifying the pattern that is being matched. Consider, for example, this pattern:
([^<]|<(?!inet))+
It matches from wherever it starts until it encounters "<inet" or the end of the data, and is the kind of pattern that might be used when processing an XML file. Each iteration of the outer parentheses matches either one character that is not "<" or a "<" that is not followed by "inet". However, each time a parenthesis is processed, a recursion occurs, so this formulation uses a stack frame for each matched character. For a long string, a lot of stack is required. Consider now this rewritten pattern, which matches exactly the same strings:
([^<]++|<(?!inet))+
This uses very much less stack, because runs of characters that do not contain "<" are "swallowed" in one item inside the parentheses. Recursion happens only when a "<" character that is not followed by "inet" is encountered (and we assume this is relatively rare). A possessive quantifier is used to stop any backtracking into the runs of non-"<" characters, but that is not related to stack usage.
This example shows that one way of avoiding stack problems when matching long subject strings is to write repeated parenthesized subpatterns to match more than one character whenever possible.
pcre2_match
()In environments where stack memory is constrained, you
might want to compile PCRE2 to use heap memory instead of
stack for remembering back-up points when pcre2_match
() is running. This makes it
run more slowly, however. Details of how to do this are
given in the pcre2build(3)
documentation. When built in this way, instead of using the
stack, PCRE2 gets memory for remembering backup points from
the heap. By default, the memory is obtained by calling the
system malloc
() function, but
you can arrange to supply your own memory management
function. For details, see the section entitled "The match
context" in the pcre2api(3)
documentation. Since the block sizes are always the same,
it may be possible to implement customized a memory handler
that is more efficient than the standard function. The
memory blocks obtained for this purpose are retained and
re-used if possible while pcre2_match
() is running. They are all
freed just before it exits.
pcre2_match
()'s
stack usageYou can set limits on the number of times the internal
match
() function is called,
both in total and recursively. If a limit is exceeded,
pcre2_match
() returns an
error code. Setting suitable limits should prevent it from
running out of stack. The default values of the limits are
very large, and unlikely ever to operate. They can be
changed when PCRE2 is built, and they can also be set when
pcre2_match
() is called. For
details of these interfaces, see the pcre2build(3)
documentation and the section entitled "The match context"
in the pcre2api(3)
documentation.
As a very rough rule of thumb, you should reckon on about 500 bytes per recursion. Thus, if you want to limit your stack usage to 8Mb, you should set the limit at 16000 recursions. A 64Mb stack, on the other hand, can support around 128000 recursions.
The pcre2test
test program
has a modifier called "find_limits" which, if applied to a
subject line, causes it to find the smallest limits that
allow a a pattern to match. This is done by calling
pcre2_match
() repeatedly with
different limits.
In Unix-like environments, there is not often a problem with the stack unless very long strings are involved, though the default limit on stack size varies from system to system. Values from 8Mb to 64Mb are common. You can find your default limit by running the command:
ulimit -s
Unfortunately, the effect of running out of stack is often SIGSEGV, though sometimes a more explicit error message is given. You can normally increase the limit on stack size by code such as this:
struct rlimit rlim; getrlimit(RLIMIT_STACK, &rlim); rlim.rlim_cur = 100*1024*1024; setrlimit(RLIMIT_STACK, &rlim);
This reads the current limits (soft and hard) using
getrlimit
(), then attempts to
increase the soft limit to 100Mb using setrlimit
(). You must do this before
calling pcre2_match
().
Last updated: 21 November 2014 Copyright (c) 1997-2014 University of Cambridge.
COPYRIGHT |
---|
This manual page is taken from the PCRE library, which is distributed under the BSD license. |