Robin Nixon shows how you can use regular expressions to perform complex searches with a minimum of code.
This excerpt is from Learning PHP, MySQL, and JavaScript . Discover how the powerful combination of PHP and MySQL provides an easy way to build modern websites complete with dynamic data and user interaction. You'll also learn how to add JavaScript to create rich Internet applications and websites.
Regular
expressions are supported by both JavaScript and PHP, as well as a number of
other languages. They make it possible to construct the most powerful of
pattern-matching algorithms within a single expression.
Every
regular expression must be enclosed in slashes. Within these slashes, certain
characters have special meanings; there are called metacharacters.
For instance, an asterisk (*)
has a meaning similar to what you have seen if you use a shell or Windows
Command prompt (but not quite the same). An asterisk means, "the text you're
trying to match may have any number of the preceding character—or none at
all."
Let's say you're looking for the name "Le Guin" and
know that someone might spell it with or without a space. Because the text is
laid out strangely (for instance, someone may have inserted extra spaces to
right-justify lines), you could have to search for a line such as:
The
difficulty of classifying Le Guin's works
So you need to match "LeGuin," as well as "Le" and
"Guin" separated by any number of spaces. The solution is to follow a space
with an asterisk:
/Le *Guin/
There's
a lot more than the name "Le Guin" in the line, but that's OK. As long as the
regular expression matches some part of the line, the test function returns a true value. What
if it's important to make sure the line contains nothing but "Le Guin"? I'll
show how to ensure that later.
Suppose that you know there is always at least one
space. In that case, you could use the plus sign (+), because it requires at least one of the preceding
character to be present:
/Le +Guin/
The dot (.) is particularly useful, because it can match
anything except a newline. Suppose that you are looking for HTML tags, which
start with "<" and end with ">". A simple way to do so is:
/<.*>/
The dot matches any character and the * expands it to
match zero or more characters, so this is saying, "match anything that lies
between < and >, even if there's nothing." You will match <>,
<em>, <br /> and so on. But if you don't want to match the empty
case, <>, you should use the + sign instead of *, like this:
/<.+>/
The plus sign expands the dot to match one or more
characters, saying, "match anything that lies between < and > as long as
there's at least one character between them." You will match <em> and
</em>, <h1> and </h1>, and tags with attributes such as:
<a href="www.mozilla.org">
Unfortunately, the plus sign keeps on matching up to
the last > on the line, so you might end up with:
<h1><b>Introduction</b></h1>
A lot
more than one tag! I'll show a better solution later on.
If you want to match the dot character itself (.), you have to
escape it by placing a backslash (\) before it, because otherwise it's a metacharacter
and matches anything. As an example, suppose you want to match the
floating-point number "5.0". The regular expression is:
/5\.0/
The
backslash can escape any metacharacter, including another backslash (in case
you're trying to match a backslash in text). However, to make things a bit
confusing, you'll see later how backslashes sometimes give the following
character a special meaning.
We just matched a floating-point number. But perhaps
you want to match "5." as well as "5.0", because both mean the same thing as a
floating-point number. You also want to match "5.00", "5.000", and so
forth—any number of zeros is allowed. You can do this by adding an
asterisk, as you've seen:
/5\.0*/
Suppose you want to match powers of increments of
units, such as kilo, mega, giga, and tera. In other words, you want all the
following to match:
1,000
1,000,000
1,000,000,000
1,000,000,000,000
...
The plus sign works here, too, but you need to group
the string ",000" so the plus sign matches the whole thing. The regular
expression is:
/1(,000)+ /
The
parentheses mean "treat this as a group when you apply something such as a plus
sign." 1,00,000 and 1,000,00 won't match because the text must have a 1
followed one or more complete groups of a comma followed by three zeroes.
The
space after the +
character indicates that the match must end when a space is encountered.
Without it 1,000,00 would incorrectly match because only the first 1,000 would
be taken into account, and the remaining ,00 would be ignored. Requiring a
space afterwards ensures matching will continue right through to the end of a
number.
Sometimes
you want to match something fuzzy, but not so broad that you want to use a dot.
Fuzziness is the great strength of regular expressions: they allow you to be as
precise or vague as you want.
One of the key features supporting fuzzy matching is
the pair of square brackets, []. It matches a single character, like a dot, but
inside the brackets you put a list of things that can match. If any of those
characters appears, the text matches. For instance, if you wanted to match both
the American spelling "gray" and the British spelling "grey," you could
specify:
/gr[ae]y/
After
the gr in the
text you're matching, there can be either an a or an e. But there must be only one of them:
whatever you put inside the brackets matches exactly one character. The group
of characters inside the brackets is called a character
class.
Inside the brackets, you can use a hyphen (-) to indicate a
range. One very common task is matching a single digit, which you can do with a
range as follows:
/[0-9]/
Digits are such a common item in regular expressions
that a single character is provided to represent them: \d. You can use
it in the place of the bracketed regular expression to match a digit:
/\d/
One other important feature of the square brackets is negation of a character class. You can turn the whole
character class on its head by placing a caret (^) after the opening bracket. Here it means, "Match any
characters except the
following." So let's say you want to find instances of "Yahoo" that lack the
following exclamation point. (The name of the company officially contains an
exclamation point!) You could do it as follows:
/Yahoo[^!]/
The
character class consists of a single character—an exclamation
point—but it is inverted by the preceding ^. This is actually not a great solution
to the problem—for instance, it fails if "Yahoo" is at the end of the
line, because then it's not followed by anything,
whereas the brackets must match a character. A better solution involves
negative look-ahead (matching something that is not followed by anything else),
but that's beyond the scope of this book.
With an understanding of character classes and
negation, you're ready now to see a better solution to the problem of matching
an HTML tag. This solution avoids going past the end of a single tag, but still
matches tags such as <em> and </em> as well as tags with attributes
such as:
<a href="www.mozilla.org">
One solution is:
/<[^>]+>/
That
regular expression may look like I dropped my teacup on the keyboard, but it is
perfectly valid and very useful. Let's break it apart. The elements are:
/ Opening
slash that indicates this is a regular expression.
< Opening
bracket of an HTML tag. This is matched exactly; it is not a metacharacter.
[^>] Character
class. The embedded ^>
means "match anything except a closing angle bracket."
+ Allows
any number of characters to match the previous [^>], as long as there is at least
one of them.
> Closing
bracket of an HTML tag. This is matched exactly.
/ Closing
slash that indicates the end of the regular expression.
We are going to look now at a commonly used regular
expression:
/[^a-zA-Z0-9_]/
There
are two other important metacharacters. They "anchor" a regular expression by
requiring that it appear in a particular place. If a caret (^) appears at the
beginning of the regular expression, the expression has to appear at the
beginning of a line of text—otherwise, it doesn't match. Similarly, If a
dollar sign ($)
appears at the end of the regular expression, the expression has to appear at
the end of a line of text.
We'll finish our exploration of regular expression
basics by answering a question raised earlier: suppose you want to make sure
there is nothing extra on a line besides the regular expression? What if you
want a line that has "Le Guin" and nothing else? We can do that by amending the
earlier regular expression to anchor the two ends:
/^Le *Guin$/
The
following table hows the metacharacters available in regular expressions.
|
Metacharacters |
Description |
|
/ |
Begins and
ends the regular expression |
|
. |
Matches
any single character except the newline |
|
element* |
Matches element
zero or more times |
|
element+ |
Matches element
one or more times |
|
element? |
Matches element
zero or one times |
|
[characters] |
Matches a
character out of those contained within the brackets |
|
[^characters] |
Matches a
single character that is not contained within the brackets |
|
(regex) |
Treats the
regex
as a group for counting or a following *, +, or ? |
|
Left|right |
Matches
either left
or right |
|
l-r |
Matches a
range of characters between l and r (only within
brackets) |
|
^ |
Requires
match to be at the string's start |
|
$ |
Requires
match to be at the string's end |
|
\b |
Matches a
word boundary |
|
\B |
Matches
where there is not a word boundary |
|
\d |
Matches a
single digit |
|
\D |
Matches a
single non-digit |
|
\n |
Matches a
newline character |
|
\s |
Matches a
whitespace character |
|
\S |
Matches a non-whitespace
character |
|
\t |
Matches a
tab character |
|
\w |
Matches a
word character (a-z, A-Z, 0-9, and _) |
|
\W |
Matches a non-word
character (anything but a-z, A-Z, 0-9, and _) |
|
\x |
x (useful if x is a
metacharacter, but you really want x) |
|
{n} |
Matches
exactly n times |
|
{n,} |
Matches n
times or more |
|
{min,max} |
Matches at
least min
and at most max times |
Provided
with this table, and looking again at the expression /[^a-zA-Z0-9_]/, you can see that it
could easily be shortened to /[^\w]/
because the single metacharacter \w
(with a lower case w)
specifies the characters a-z,
A-Z, 0-9, and _.
In
fact, we can be cleverer than that, because the metacharacter \W (with an upper
case W)
specifies all characters except
for a-z, A-Z, 0-9, and _. Therefore we
could also drop the ^
metacharacter and simply use /[\W]/
for the expression.
To give
you more ideas of how this all works the following table shows a range of
expressions and the patterns they match.
|
Example |
Matches |
|
r |
The first r in The
quick brown |
|
rec[ei][ei]ve |
Either of receive or receive (but also receeve
or reciive) |
|
rec[ei]{2}ve |
Either of receive or receive (but also receeve
or reciive) |
|
rec(ei)|(ie)ve |
Either of receive or receive (but not receeve
or reciive) |
|
cat |
The word cat in I
like cats and dogs |
|
cat|dog |
Either of
the words cat or dog in I
like cats and dogs |
|
\. |
. (the \
is necessary because . is a
metacharacter) |
|
5\.0* |
5., 5.0,
5.00, 5.000, etc. |
|
a-f |
Any of the
characters a, b, c,
d, e or f |
|
cats$ |
Only the
final cats in My cats are friendly cats |
|
^my |
Only the
first my in my cats are my pets |
|
\d{2,3} |
Any two or
three digit number (00 through
999) |
|
7(,000)+ |
7,000;7,000,000;
7,000,000,000;
7,000,000,000,000; etc. |
|
[\w]+ |
Any word
of one or more characters |
|
[\w]{5} |
Any five
letter word |
Some
additional modifiers are available for regular expressions:
For example, the
expression /cats/g will match both occurrences of the word cats in the sentence "I like cats and cats like me".
Similarly /dogs/gi will match both occurrences of the word dogs (Dogs
and dogs) in the sentence "Dogs like other dogs", because you can use these
specifiers together.
In
JavaScript you will use regular expressions mostly in two methods: test (which you have
already seen) and replace.
Whereas test
just tells you whether its argument matches the regular expression, replace takes a
second parameter: the string to replace the text that matches. Like most
functions, replace
generates a new string as a return value; it does not change the input.
To compare the two methods, the following statement
just returns true to let us know that the word "cats" appears at least
once somewhere within the string:
document.write(/cats/i.test("Cats are fun. I
like cats."))
But the following statement replaces both occurrences
of the word cats with the word
dogs, printing the
result. The search has to be global (/g) to find all occurrences, and case-insensitive (/i) to find the capitalized "Cats":
document.write("Cats are fun. I like
cats.".replace(/cats/gi,"dogs"))
If you
try out the statement, you'll see a limitation of replace: because it replaces text with
exactly the string you tell it to use, the first word "Cats" is replaced by
"dogs" instead of "Dogs".
The
most common regular expression functions that you are likely to use in PHP are preg_match, preg_match_all, and preg_replace.
To test whether the word cats appears anywhere within a string, in any combination
of upper- and lowercase, you could use preg_match like this:
$n = preg_match("/cats/i", "Cats are
fun. I like cats.");
Because PHP uses 1 for true and
0 for false, the preceding statement sets $n to 1. The
first argument is the regular expression and the second is the text to match.
But preg_match is
actually a good deal more powerful and complicated, because it takes a third
argument that shows what text matched:
$n = preg_match("/cats/i", "Cats are
fun. I like cats.", $match);
echo "$n Matches:
$match[0]";
The third argument is an array (here given the name $match). The
function puts the text that matches into the first element, so if the match is
successful you can find the text that matched in $match[0]. In this example, the output lets us know that the
matched text was capitalized:
1 Matches: Cats
If you wish to locate all matches, you use the preg_match_all
function, like this:
$n = preg_match_all("/cats/i", "Cats
are fun. I like cats.", $match);
echo "$n Matches: ";
for ($j=0 ; $j < $n ; ++$j) echo
$match[0][$j]." ";
As
before, $match
is passed to the function and the element $match[0] is assigned the matches made,
but this time as a sub-array. To display the sub-array, this example iterates
through it with a for
loop.
When you want to replace part of a string, you can use
preg_replace as
shown here. This example replaces all occurrences of the word cats with the word dogs, regardless of case:
echo preg_replace("/cats/i",
"dogs", "Cats are fun. I like cats.");
About the Author
Robin
Nixon is a technical author, specializing in web development, who has written
three books. This article is reproduced from his latest book, Learning, PHP,
MYSQL and JavaScript, published by O'Reilly, ISBN 0596157134.
If you enjoyed this excerpt, buy a copy of Learning PHP, MySQL, and JavaScript .
Copyright © 2009 O'Reilly Media, Inc.