Perl has long been considered the
benchmark for powerful regular expressions. PHP uses a C library
called pcre
to provide almost complete support for
Perl’s arsenal of regular expression features. Perl
regular expressions include the POSIX classes and anchors described
earlier. A POSIX-style character class in a Perl regular expression
works and understands non-English characters using the Unix locale
system. Perl regular expressions act on arbitrary binary data, so you
can safely match with patterns or strings that contain the NUL-byte
(\x00
).
Perl-style
regular
expressions emulate the Perl syntax for patterns, which means that
each pattern must be enclosed in a pair of delimiters. Traditionally,
the slash (/
) character is used; for example,
/
pattern
/
.
However, any nonalphanumeric character other than the backslash
character (\
) can be used to delimit a Perl-style
pattern. This is useful when matching strings containing slashes,
such as filenames. For example, the following are equivalent:
preg_match('/\/usr\/local\//', '/usr/local/bin/perl'); // returns true preg_match('#/usr/local/#', '/usr/local/bin/perl'); // returns true
Parentheses (( )
), curly braces ({}
), square brackets
([]
), and angle brackets
(<>
) can be used as pattern delimiters:
preg_match('{/usr/local/}', '/usr/local/bin/perl'); // returns true
Section 4.10.8
discusses the single-character modifiers you can put after the
closing delimiter to modify the behavior of the regular expression
engine. A very useful one is x
, which makes the
regular expression engine strip whitespace and
#
-marked comments from the regular expression
before matching. These two patterns are the same, but one is much
easier to read:
'/([[:alpha:]]+)\s+\1/' '/( # start capture [[:alpha:]]+ # a word \s+ # whitespace \1 # the same word again ) # end capture /x'
While Perl’s regular expression syntax includes the POSIX constructs we talked about earlier, some pattern components have a different meaning in Perl. In particular, Perl’s regular expressions are optimized for matching against single lines of text (although there are options that change this behavior).
The period (.
)
matches any character except for a newline (\n
).
The
dollar sign ($
)
matches at the end of the string or, if the string ends with a
newline, just before that newline:
preg_match('/is (.*)$/', "the key is in my pants", $captured); // $captured[1] is 'in my pants'
Perl-style regular expressions support the POSIX character classes but also define some of their own, as shown in Table 4-9.
Perl-style regular expressions also support additional anchors, as listed in Table 4-10.
Table 4-10. Perl-style anchors
The POSIX quantifiers, which Perl also supports, are always greedy . That is, when faced with a quantifier, the engine matches as much as it can while still satisfying the rest of the pattern. For instance:
preg_match('/(<.*>)/', 'do <b>not</b> press the button', $match); // $match[1] is '<b>not</b>'
The regular expression matches from the first less-than sign to the
last greater-than sign. In effect, the .*
matches
everything after the first less-than sign, and the engine backtracks
to make it match less and less until finally there’s
a greater-than sign to be matched.
This greediness can be a problem. Sometimes you need
minimal (non-greedy)
matching
—that is, quantifiers that match
as few times as possible to satisfy the rest of the pattern. Perl
provides a parallel set of quantifiers that match minimally.
They’re easy to remember, because
they’re the same as the greedy quantifiers, but with
a question mark
(?
) appended. Table 4-11 shows
the corresponding greedy and non-greedy quantifiers supported by
Perl-style regular expressions.
Here’s how to match a tag using a non-greedy quantifier:
preg_match('/(<.*?>)/', 'do <b>not</b> press the button', $match); // $match[1] is '<b>'
Another, faster way is to use a character class to match every non-greater-than character up to the next greater-than sign:
preg_match('/(<[^>]*>)/', 'do <b>not</b> press the button', $match); // $match[1] is '<b>'
If you
enclose a part of a pattern in
parentheses, the text that matches that
subpattern is captured and can be accessed later. Sometimes, though,
you want to create a subpattern without capturing the matching text.
In Perl-compatible regular expressions, you can do this using the
(?:
subpattern
)
construct:
preg_match('/(?:ello)(.*)/', 'jello biafra', $match); // $match[1] is ' biafra'
You can refer to
text
captured earlier in a pattern with a
backreference: \1
refers to
the contents of the first subpattern, \2
refers to
the second, and so on. If you nest subpatterns, the first begins with
the first opening parenthesis, the second begins with the second
opening parenthesis, and so on.
For instance, this identifies doubled words:
preg_match('/([[:alpha:]]+)\s+\1/', 'Paris in the the spring', $m); // returns true and $m[1] is 'the'
You can’t capture more than 99 subpatterns.
Perl-style
regular
expressions let you put single-letter options
(flags) after the regular expression
pattern to modify the interpretation, or
behavior, of the match. For instance, to match case-insensitively,
simply use the i
flag:
preg_match('/cat/i', 'Stop, Catherine!'); // returns true
Table 4-12 shows the modifiers from Perl that are supported in Perl-compatible regular expressions.
Table 4-12. Perl flags
PHP’s Perl-compatible regular expression functions also support other modifiers that aren’t supported by Perl, as listed in Table 4-13.
Table 4-13. Additional PHP flags
It’s possible to use more than one option in a single pattern, as demonstrated in the following example:
$message = <<< END To: you@youcorp From: me@mecorp Subject: pay up Pay me or else! END; preg_match('/^subject: (.*)/im', $message, $match); // $match[1] is 'pay up'
In addition to specifying patternwide options after the closing pattern delimiter, you can specify options within a pattern to have them apply only to part of the pattern. The syntax for this is:
(?flags
:subpattern
)
For example, only the word “PHP” is case-insensitive in this example:
preg_match('/I like (?i:PHP)/', 'I like pHp'); // returns true
The i
, m
, s
,
U
, x
, and X
options can be applied internally in this fashion. You can use
multiple options at once:
preg_match('/eat (?ix:fo o d)/', 'eat FoOD'); // returns true
Prefix an option
with a hyphen (-
) to turn it off:
preg_match('/(?-i:I like) PHP/i', 'I like pHp'); // returns true
An alternative form enables or disables the flags until the end of the enclosing subpattern or pattern:
preg_match('/I like (?i)PHP/', 'I like pHp'); // returns true preg_match('/I (like (?i)PHP) a lot/', 'I like pHp a lot', $match); // $match[1] is 'like pHp'
Inline flags do not enable capturing. You need an additional set of capturing parentheses do that.
It’s sometimes useful in patterns to be able to say “match here if this is next.” This is particularly common when you are splitting a string. The regular expression describes the separator, which is not returned. You can use lookahead to make sure (without matching it, thus preventing it from being returned) that there’s more data after the separator. Similarly, lookbehind checks the preceding text.
Lookahead and lookbehind come in two forms: positive and negative. A positive lookahead or lookbehind says “the next/preceding text must be like this.” A negative lookahead or lookbehind says “the next/preceding text must not be like this.” Table 4-14 shows the four constructs you can use in Perl-compatible patterns. None of the constructs captures text.
A simple use of positive lookahead is splitting a Unix mbox mail file
into individual messages. The word "From"
starting
a line by itself indicates the start of a new message, so you can
split the mailbox into messages by specifying the separator as the
point where the next text is "From"
at the start
of a line:
$messages = preg_split('/(?=^From )/m', $mailbox);
A simple use of negative lookbehind is to extract quoted strings that
contain quoted delimiters. For instance, here’s how
to extract a single-quoted string (note that the regular expression
is commented using the x
modifier):
$input = <<< END
name = 'Tim O\'Reilly';
END;
$pattern = <<< END
' # opening quote
( # begin capturing
.*? # the string
(?<! \\\\ ) # skip escaped quotes
) # end capturing
' # closing quote
END;
preg_match( "($pattern)x", $input, $match);
echo $match[1];
Tim O\'Reilly
The only tricky part is that, to get a pattern that looks behind to see if the last character was a backslash, we need to escape the backslash to prevent the regular expression engine from seeing “\)”, which would mean a literal close parenthesis. In other words, we have to backslash that backslash: “\\)”. But PHP’s string-quoting rules say that \\ produces a literal single backslash, so we end up requiring four backslashes to get one through the regular expression! This is why regular expressions have a reputation for being hard to read.
Perl limits lookbehind to constant-width expressions. That is, the expressions cannot contain quantifiers, and if you use alternation, all the choices must be the same length. The Perl-compatible regular expression engine also forbids quantifiers in lookbehind, but does permit alternatives of different lengths.
The rarely used once-only subpattern, or cut, prevents worst-case behavior by the regular expression engine on some kinds of patterns. Once matched, the subpattern is never backed out of.
The common use for the once-only subpattern is when you have a repeated expression that may itself be repeated:
/(a+|b+)*\.+/
This code snippet takes several seconds to report failure:
$p = '/(a+|b+)*\.+$/'; $s = 'abababababbabbbabbaaaaaabbbbabbababababababbba..!'; if (preg_match($p, $s)) { echo "Y"; } else { echo "N"; }
This is because the regular expression engine tries all the different
places to start the match, but has to backtrack out of each one,
which takes time. If you know that once something is matched it
should never be backed out of, you should mark it with
(?>
subpattern
)
:
$p = '/(?>a+|b+)*\.+$/';
The cut never changes the outcome of the match; it simply makes it fail faster.
A
conditional expression is like
an if
statement in a regular expression.
The general form is:
(?(condition
)yespattern
) (?(condition
)yespattern
|nopattern
)
If the assertion succeeds, the regular expression engine matches the
yespattern
. With the second form, if the
assertion doesn’t succeed, the regular expression
engine skips the yespattern
and tries to
match the nopattern
.
The assertion can be one of two types: either a backreference, or a lookahead or lookbehind match. To reference a previously matched substring, the assertion is a number from 1-99 (the most backreferences available). The condition uses the pattern in the assertion only if the backreference was matched. If the assertion is not a backreference, it must be a positive or negative lookahead or lookbehind assertion.
There are five classes of functions that work with Perl-compatible regular expressions: matching, replacing, splitting, filtering, and a utility function for quoting text.
The preg_match( )
function performs Perl-style pattern
matching on a string. It’s the equivalent of the
m//
operator in Perl. The preg_match( )
function takes the same arguments and gives the same
return value as the ereg( )
function, except that
it takes a Perl-style pattern instead of a standard pattern:
$found = preg_match(pattern
,string
[,captured
]);
For example:
preg_match('/y.*e$/', 'Sylvie'); // returns true preg_match('/y(.*)e$/', 'Sylvie', $m); // $m is array('Sylvie', 'lvi')
While there’s an eregi( )
function to match
case-insensitively,
there’s no preg_matchi( )
function. Instead, use the i
flag on the pattern:
preg_match('y.*e$/i', 'SyLvIe'); // returns true
The preg_match_all( )
function repeatedly matches
from where the last match ended, until no more matches can be made:
$found = preg_match_all(pattern
,string
,matches
[,order
]);
The order
value, either
PREG_PATTERN_ORDER
or
PREG_SET_ORDER
, determines the layout of
matches
. We’ll look at
both, using this code as a guide:
$string = <<< END 13 dogs 12 rabbits 8 cows 1 goat END; preg_match_all('/(\d+) (\S+)/', $string, $m1, PREG_PATTERN_ORDER); preg_match_all('/(\d+) (\S+)/', $string, $m2, PREG_SET_ORDER);
With PREG_PATTERN_ORDER
(the default), each
element of the array corresponds to a particular capturing
subpattern. So $m1[0]
is an array of all the
substrings that matched the pattern, $m1[1]
is an
array of all the substrings that matched the first subpattern (the
numbers), and $m1[2]
is an array of all the
substrings that matched the second subpattern (the words). The array
$m1
has one more elements than subpatterns.
With PREG_SET_ORDER
, each element of the array
corresponds to the next attempt to match the whole pattern. So
$m2[0]
is an array of the first set of matches
('13 dogs'
, '13'
,
'dogs'
), $m2[1]
is an array of
the second set of matches ('12 rabbits'
,
'12'
, 'rabbits'
), and so on.
The array $m2
has as many elements as there were
successful matches of the entire pattern.
Example 4-2 fetches the HTML at a particular web address into a string and extracts the URLs from that HTML. For each URL, it generates a link back to the program that will display the URLs at that address.
Example 4-2. Extracting URLs from an HTML page
<?php if (getenv('REQUEST_METHOD') == 'POST') { $url = $_POST[url]; } else { $url = $_GET[url]; } ?> <form action="<?php $PHP_SELF?>"method="POST"> URL: <input type="text" name="url" value="<?= $url ?>" /><br> <input type="submit"> </form> <?php if ($url) { $remote = fopen($url, 'r'); $html = fread($remote, 1048576); // read up to 1 MB of HTML fclose($remote); $urls = '(http|telnet|gopher|file|wais|ftp)'; $ltrs = '\w'; $gunk = '/#~:.?+=&%@!\-'; $punc = '.:?\-'; $any = "$ltrs$gunk$punc"; preg_match_all("{ \b # start at word boundary $urls : # need resource and a colon [$any] +? # followed by one or more of any valid # characters--but be conservative # and take only what you need (?= # the match ends at [$punc]* # punctuation [^$any] # followed by a non-URL character | # or $ # the end of the string ) }x", $html, $matches); printf("I found %d URLs<P>\n", sizeof($matches[0])); foreach ($matches[0] as $u) { $link = $PHP_SELF . '?url=' . urlencode($u); echo "<A HREF='$link'>$u</A><BR>\n"; } } ?>
The preg_replace( )
function behaves like the search and replace operation in your text
editor. It finds all occurrences of a pattern in a string and changes
those occurrences to something else:
$new = preg_replace(pattern
,replacement
,subject
[,limit
]);
The most common usage has all the argument strings, except for the
integer limit
. The limit is the maximum
number of occurrences of the pattern to replace (the default, and the
behavior when a limit of -1
is passed, is all
occurrences).
$better = preg_replace('/<.*?>/', '!', 'do <b>not</b> press the button'); // $better is 'do !not! press the button'
Pass an array of strings as subject
to
make the substitution on all of them. The new strings are returned
from preg_replace( )
:
$names = array('Fred Flintstone', 'Barney Rubble', 'Wilma Flintstone', 'Betty Rubble'); $tidy = preg_replace('/(\w)\w* (\w+)/', '\1 \2', $names); // $tidy is array ('F Flintstone', 'B Rubble', 'W Flintstone', 'B Rubble')
To perform multiple substitutions on the same string or array of
strings with one call to preg_replace( )
, pass
arrays of patterns and replacements:
$contractions = array("/don't/i", "/won't/i", "/can't/i"); $expansions = array('do not', 'will not', 'can not'); $string = "Please don't yell--I can't jump while you won't speak"; $longer = preg_replace($contractions, $expansions, $string); // $longer is 'Please do not yell--I can not jump while you will not speak';
If you give fewer replacements than patterns, text matching the extra patterns is deleted. This is a handy way to delete a lot of things at once:
$html_gunk = array('/<.*?>/', '/&.*?;/'); $html = 'é : <b>very</b> cute'; $stripped = preg_replace($html_gunk, array( ), $html); // $stripped is ' : very cute'
If you give an array of patterns but a single string replacement, the same replacement is used for every pattern:
$stripped = preg_replace($html_gunk, '', $html);
The replacement can use backreferences. Unlike backreferences in
patterns, though, the preferred syntax for backreferences in
replacements is $1
, $2
,
$3
, etc. For example:
echo preg_replace('/(\w)\w+\s+(\w+)/', '$2, $1.', 'Fred Flintstone')
Flintstone, F.
The /e
modifier makes preg_replace( )
treat the replacement string as
PHP code that returns the actual string to use in the replacement.
For example, this converts every Celsius temperature to Fahrenheit:
$string = 'It was 5C outside, 20C inside';
echo preg_replace('/(\d+)C\b/e', '$1*9/5+32', $string);
It was 41 outside, 68 inside
This more complex example expands variables in a string:
$name = 'Fred'; $age = 35; $string = '$name is $age'; preg_replace('/\$(\w+)/e', '$$1', $string);
Each match isolates the name of a variable ($name
,
$age
). The $1
in the
replacement refers to those names, so the PHP code actually executed
is $name
and $age
. That code
evaluates to the value of the variable, which is
what’s used as the replacement. Whew!
Whereas you use
preg_match_all( )
to extract chunks of a string
when you know what those chunks are, use preg_split( )
to extract chunks when you know what
separates the chunks from each other:
$chunks = preg_split(pattern
,string
[,limit
[,flags
]]);
The pattern
matches a separator between
two chunks. By default, the separators are not returned. The optional
limit
specifies the maximum number of
chunks to return (-1
is the default, which means
all chunks). The flags
argument is a
bitwise OR combination of the flags
PREG_SPLIT_NO_EMPTY
(empty chunks are not
returned) and PREG_SPLIT_DELIM_CAPTURE
(parts of
the string captured in the pattern are returned).
For example, to extract just the operands from a simple numeric expression, use:
$ops = preg_split('{[+*/-]}', '3+5*9/2'); // $ops is array('3', '5', '9', '2')
To extract the operands and the operators, use:
$ops = preg_split('{([+*/-])}', '3+5*9/2', -1, PREG_SPLIT_DELIM_CAPTURE); // $ops is array('3', '+', '5', '*', '9', '/', '2')
An empty pattern matches at every boundary between characters in the string. This lets you split a string into an array of characters:
$array = preg_split('//', $string);
A variation on preg_replace( )
is
preg_replace_callback( )
. This calls a function to
get the replacement string. The function is passed an array of
matches (the zeroth element is all the text that matched the pattern,
the first is the contents of the first captured subpattern, and so
on). For example:
function titlecase ($s) {
return ucfirst(strtolower($s[0]));
}
$string = 'goodbye cruel world';
$new = preg_replace_callback('/\w+/', 'titlecase', $string);
echo $new;
Goodbye Cruel World
The preg_grep( )
function returns those elements of an array that
match a given pattern:
$matching = preg_grep(pattern
,array
);
For instance, to get only the filenames that end in .txt, use:
$textfiles = preg_grep('/\.txt$/', $filenames);
The preg_quote( )
function
creates a regular expression that matches only a given string:
$re = preg_quote(string
[,delimiter
]);
Every character in string
that has special
meaning inside a regular expression (e.g., *
or
$
) is prefaced with a backslash:
echo preg_quote('$5.00 (five bucks)');
\$5\.00 \(five bucks\)
The optional second argument is an extra character to be quoted. Usually, you pass your regular expression delimiter here:
$to_find = '/usr/local/etc/rsync.conf'; $re = preg_quote($filename, '/'); if (preg_match("/$re", $filename)) { // found it! }
Although very similar, PHP’s implementation of Perl-style regular expressions has a few minor differences from actual Perl regular expressions:
The null character (ASCII 0) is not allowed as a literal character within a pattern string. You can reference it in other ways, however (
\000
,\x00
, etc.).The
\E
,\G
,\L
,\l
,\Q
,\u
, and\U
options are not supported.The
(?{
some perl code
})
construct is not supported.The /
D
, /G
, /U
, /u
, /A
, and /X
modifiers are supported.The vertical tab
\v
counts as a whitespace character.Lookahead and lookbehind assertions cannot be repeated using
*
,+
, or?
.Parenthesized submatches within negative assertions are not remembered.
Alternation branches within a lookbehind assertion can be of different lengths.
Get Programming PHP now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.