Regular expressions are specially encoded text strings used as patterns for matching sets of strings. They began to emerge in the 1940s as a way to describe regular languages, but they really began to show up in the programming world during the 1970s. The first place I could find them showing up was in the QED text editor written by Ken Thompson.
“A regular expression is a pattern which specifies a set of strings of characters; it is said to match certain strings.” —Ken Thompson
Regular expressions later became an important part of the tool suite that emerged from the Unix operating system—the ed, sed and vi (vim) editors, grep, AWK, among others. But the ways in which regular expressions were implemented were not always so regular.
Note
This book takes an inductive approach; in other words, it moves from the specific to the general. So rather than an example after a treatise, you will often get the example first and then a short treatise following that. It’s a learn-by-doing book.
Regular expressions have a reputation for being gnarly, but that all depends on how you approach them. There is a natural progression from something as simple as this:
\d
a character shorthand that matches any digit from 0 to 9, to something a bit more complicated, like:
^(\(\d{3}\)|^\d{3}[.-]?)?\d{3}[.-]?\d{4}$
which is where we’ll wind up at the end of this chapter: a fairly robust regular expression that matches a 10-digit, North American telephone number, with or without parentheses around the area code, or with or without hyphens or dots (periods) to separate the numbers. (The parentheses must be balanced, too; in other words, you can’t just have one.)
Note
Chapter 10 shows you a slightly more sophisticated regular expression for a phone number, but the one above is sufficient for the purposes of this chapter.
If you don’t get how that all works yet, don’t worry: I’ll explain the whole expression a little at a time in this chapter. If you will just follow the examples (and those throughout the book, for that matter), writing regular expressions will soon become second nature to you. Ready to find out for yourself?
I at times represent Unicode characters in this book using their code point—a four-digit, hexadecimal (base 16) number. These code points are shown in the form U+0000. U+002E, for example, represents the code point for a full stop or period (.).
First let me introduce you to the Regexpal website at http://www.regexpal.com. Open the site up in a browser, such as Google Chrome or Mozilla Firefox. You can see what the site looks like in Figure 1-1.
You can see that there is a text area near the top, and a larger text area below that. The top text box is for entering regular expressions, and the bottom one holds the subject or target text. The target text is the text or set of strings that you want to match.
Note
At the end of this chapter and each following chapter, you’ll find a “Technical Notes” section. These notes provide additional information about the technology discussed in the chapter and tell you where to get more information about that technology. Placing these notes at the end of the chapters helps keep the flow of the main text moving forward rather than stopping to discuss each detail along the way.
Now we’ll match a North American phone number with a regular expression. Type the phone number shown here into the lower section of Regexpal:
707-827-7019
Do you recognize it? It’s the number for O’Reilly Media.
Let’s match that number with a regular expression. There are lots of ways to do this, but to start out, simply enter the number itself in the upper section, exactly as it is written in the lower section (hold on now, don’t sigh):
707-827-7019
What you should see is the phone number you entered in the lower box highlighted from beginning to end in yellow. If that is what you see (as shown in Figure 1-2), then you are in business.
Note
When I mention colors in this book, in relation to something you might see in an image or a screenshot, such as the highlighting in Regexpal, those colors may appear online and in e-book versions of this book, but, alas, not in print. So if you are reading this book on paper, then when I mention a color, your world will be grayscale, with my apologies.
What you have done in this regular expression is use something called a string literal to match a string in the target text. A string literal is a literal representation of a string.
Now delete the number in the upper box and replace it with just the number 7. Did you see what happened? Now only the sevens are highlighted. The literal character (number) 7 in the regular expression matches the four instances of the number 7 in the text you are matching.
What if you wanted to match all the numbers in the phone number, all at once? Or match any number for that matter?
Try the following, exactly as shown, once again in the upper text box:
[0-9]
All the numbers (more precisely digits) in the
lower section are highlighted, in alternating yellow and blue. What the
regular expression [0-9]
is saying to
the regex processor is, “Match any digit you find in the range 0 through
9.”
The square brackets are not literally matched because they are
treated specially as metacharacters. A metacharacter has
special meaning in regular expressions and is reserved. A regular
expression in the form [0-9]
is
called a character class, or sometimes a character set.
You can limit the range of digits more precisely and get the same result using a more specific list of digits to match, such as the following:
[012789]
This will match only those digits listed, that is, 0, 1, 2, 7, 8, and 9. Try it in the upper box. Once again, every digit in the lower box will be highlighted in alternating colors.
To match any 10-digit, North American phone number, whose parts are separated by hyphens, you could do the following:
[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
This will work, but it’s bombastic. There is a better way with something called a shorthand.
Yet another way to match digits, which you saw at the beginning of the
chapter, is with \d
which, by itself, will match all Arabic digits, just like
[0-9]
. Try that in the top section and,
as with the previous regular expressions, the digits below will be
highlighted. This kind of regular expression is called a
character shorthand. (It is also called a character escape, but this
term can be a little misleading, so I avoid it. I’ll explain
later.)
To match any digit in the phone number, you could also do this:
\d\d\d-\d\d\d-\d\d\d\d
Repeating the \d
three and four
times in sequence will exactly match three and four digits in sequence.
The hyphen in the above regular expression is entered as a literal
character and will be matched as such.
What about those hyphens? How do you match them? You can use a
literal hyphen (-) as already shown, or you could use an escaped uppercase
D (\D
), which
matches any character that is not a
digit.
This sample uses \D
in place of
the literal hyphen.
\d\d\d\D\d\d\d\D\d\d\d\d
Once again, the entire phone number, including the hyphens, should be highlighted this time.
You could also match those pesky hyphens with a dot (.):
\d\d\d.\d\d\d.\d\d\d\d
The dot or period essentially acts as a wildcard and will match any character (except, in certain situations, a line ending). In the example above, the regular expression matches the hyphen, but it could also match a percent sign (%):
707%827%7019
Or a vertical bar (|):
707|827|7019
Note
As I mentioned, the dot character (officially, the full stop) will not normally match a new line character, such as a line feed (U+000A). However, there are ways to make it possible to match a newline with a dot, which I will show you later. This is often called the dotall option.
You’ll now match just a portion of the phone number using what is
known as a capturing group. Then you’ll refer to
the content of the group with a backreference. To create a capturing
group, enclose a \d
in a pair of
parentheses to place it in a group, and then follow it with a \1
to backreference what was captured:
(\d)\d\1
The \1
refers back to what was
captured in the group enclosed by parentheses. As a result, this regular expression matches
the prefix 707
. Here is a breakdown of
it:
(\d)
matches the first digit and captures it (the number 7)\d
matches the next digit (the number 0) but does not capture it because it is not enclosed in parentheses\1
references the captured digit (the number 7)
This will match only the area code. Don’t worry if you don’t fully understand this right now. You’ll see plenty of examples of groups later in the book.
You could now match the whole phone number with one group and several backreferences:
(\d)0\1\D\d\d\1\D\1\d\d\d
But that’s not quite as elegant as it could be. Let’s try something that works even better.
Here is yet another way to match a phone number using a different syntax:
\d{3}-?\d{3}-?\d{4}
The numbers in the curly braces tell the regex processor exactly how many occurrences of those digits you want it to look for. The braces with numbers are a kind of quantifier. The braces themselves are considered metacharacters.
The question mark (?
) is
another kind of quantifier. It follows the hyphen in the
regular expression above and means that the hyphen is optional—that is,
that there can be zero or one occurrence of the hyphen (one or none).
There are other quantifiers such as the plus sign (+
), which
means “one or more,” or the asterisk (*
) which means “zero or more.”
Using quantifiers, you can make a regular expression even more concise:
(\d{3,4}[.-]?)+
The plus sign again means that the quantity can occur one or more
times. This regular expression will match either three or four digits,
followed by an optional hyphen or dot, grouped together by parentheses,
one or more times (+
).
Is your head spinning? I hope not. Here’s a character-by-character analysis of the regular expression above:
(
open a capturing group\
start character shorthand (escape the following character)d
end character shorthand (match any digit in the range 0 through 9 with\d
){
open quantifier3
minimum quantity to match,
separate quantities4
maximum quantity to match}
close quantifier[
open character class.
dot or period (matches literal dot)-
literal character to match hyphen]
close character class?
zero or one quantifier)
close capturing group+
one or more quantifier
This all works, but it’s not quite right because it will also match other groups of 3 or 4 digits, whether in the form of a phone number or not. Yes, we learn from our mistakes better than our successes.
So let’s improve it a little:
(\d{3}[.-]?){2}\d{4}
This will match two nonparenthesized sequences of three digits each, followed by an optional hyphen, and then followed by exactly four digits.
Finally, here is a regular expression that allows literal parentheses to optionally wrap the first sequence of three digits, and makes the area code optional as well:
^(\(\d{3}\)|^\d{3}[.-]?)?\d{3}[.-]?\d{4}$
To ensure that it is easy to decipher, I’ll look at this one character by character, too:
^
(caret) at the beginning of the regular expression, or following the vertical bar (|
), means that the phone number will be at the beginning of a line.(
opens a capturing group.{3}
is a quantifier that, following\d
, matches exactly three digits.\)
matches a literal close parenthesis.|
(the vertical bar) indicates alternation, that is, a given choice of alternatives. In other words, this says “match an area code with parentheses or without them.”^
matches the beginning of a line.\d
matches a digit.[.-]?
matches an optional dot or hyphen.?
make the group optional, that is, the prefix in the group is not required.\d
matches a digit.{3}
matches exactly three digits.\d
matches a digit.{4}
matches exactly four digits.
This final regular expression matches a 10-digit, North American telephone number, with or without parentheses, hyphens, or dots. Try different forms of the number to see what will match (and what won’t).
Note
The capturing group in the above regular expression is not necessary. The group is necessary, but the capturing part is not. There is a better way to do this: a non-capturing group. When we revisit this regular expression in the last chapter of the book, you’ll understand why.
To conclude this chapter, I’ll show you the regular expression for a phone number in several applications.
TextMate is an editor that is available only on the Mac and uses the same regular expression library as the Ruby programming language. You can use regular expressions through the Find (search) feature, as shown in Figure 1-3. Check the box next to Regular expression.
Notepad++ is available on Windows and is a popular, free editor that uses the PCRE regular expression library. You can access them through search and replace (Figure 1-4) by clicking the radio button next to Regular expression.
Oxygen is also a popular and powerful XML editor that uses Perl 5 regular expression syntax. You can access regular expressions through the search and replace dialog, as shown in Figure 1-5, or through its regular expression builder for XML Schema. To use regular expressions with Find/Replace, check the box next to Regular expression.
This is where the introduction ends. Congratulations. You’ve covered a lot of ground in this chapter. In the next chapter, we’ll focus on simple pattern matching.
What a regular expression is
How to use Regexpal, a simple regular expression processor
How to match string literals
How to match digits with a character class
How to match a digit with a character shorthand
How to match a non-digit with a character shorthand
How to use a capturing group and a backreference
How to match an exact quantity of a set of strings
How to match a character optionally (zero or one) or one or more times
How to match strings at either the beginning or the end of a line
Regexpal (http://www.regexpal.com) is a web-based, JavaScript-powered regex implementation. It’s not the most complete implementation, and it doesn’t do everything that regular expressions can do; however, it’s a clean, simple, and very easy-to-use learning tool, and it provides plenty of features for you to get started.
You can download the Chrome browser from https://www.google.com/chrome or Firefox from http://www.mozilla.org/en-US/firefox/new/.
Why are there so many ways of doing things with regular expressions? One reason is because regular expressions have a wonderful quality called composability. A language, whether a formal, programming or schema language, that has the quality of composability (James Clark explains it well at http://www.thaiopensource.com/relaxng/design.html#section:5) is one that lets you take its atomic parts and composition methods and then recombine them easily in different ways. Once you learn the different parts of regular expressions, you will take off in your ability to match strings of any kind.
TextMate is available at http://www.macromates.com. For more information on regular expressions in TextMate, see http://manual.macromates.com/en/regular_expressions.
For more information on Notepad, see http://notepad-plus-plus.org. For documentation on using regular expressions with Notepad, see http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions.
Find out more about Oxygen at http://www.oxygenxml.com. For information on using regex through find and replace, see http://www.oxygenxml.com/doc/ug-editor/topics/find-replace-dialog.html. For information on using its regular expression builder for XML Schema, see http://www.oxygenxml.com/doc/ug-editor/topics/XML-schema-regexp-builder.html.
Get Introducing Regular Expressions now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.