You have already seen groups in action. Groups surround text with parentheses to help perform some operation, such as the following:
Performing alternation, a choice between two or more optional patterns
Creating subpatterns
Capturing a group to later reference with a backreference
Applying an operation to a grouped pattern, such as a quantifer
Using non-capturing groups
Atomic grouping (advanced)
We’ll be using a few contrived examples, in addition to the text from “The Rime of the Ancyent Mariner” again, in rime.txt. This time, I’ll use the desktop version of RegExr, as well as other tools like sed. You can download the desktop version of RegExr from http://www.regexr.com, for Windows, Mac, or Linux (it was written with Adobe AIR). Click the Desktop Version link on the RegExr web page (lower-right corner) for more information.
Simply said, alternation gives you a choice of alternate patterns to match. For example, let’s say you wanted to find out how many occurrences of the article the are in the “The Rime of the Ancient Mariner.” The problem is, the word occurs as THE, The, and the in the poem. You can use alternation to deal with this peculiarity.
Open the RegExr desktop application by double-clicking on its icon. It looks very much like the online version but has the advantage of being local on your machine, so you won’t suffer the network issues that sometimes occur when using web applications. I’ve copied and pasted the entire poem in RegExr desktop for the next exercise. I’m using it on a Mac running OS X Lion.
In the top text box, enter the pattern:
(the|The|THE)
and you’ll see all occurrences of the in the poem highlighted in the lower box (see Figure 4-1). Use the scroll bar to view more of the result.
We can make this group shorter by applying an option. Options let you specify the way you would like to search for a pattern. For example, the option:
(?i)
makes your pattern case-insensitive, so instead of using the original pattern with alternation, you can do this instead:
(?i)the
Try this in RegExr to see how it works. You can also specify case-insensitivity by checking ignoreCase in RegExr, but both will work. This and other options or modifiers are listed in Table 4-1.
Table 4-1. Options in regular expressions
Option | Description | Supported by |
---|---|---|
| Unix lines | Java |
| Case insensitive | |
| Allow duplicate names | PCRE[a] |
| Multiline | PCRE, Perl, Java |
| Single line (dotall) | PCRE, Perl, Java |
| Unicode case | Java |
| Default match lazy | PCRE |
| Ignore whitespace, comments | PCRE, Perl, Java |
| Unset or turn off options | PCRE |
[a] See “Named Subpatterns” in http://www.pcre.org/pcre.txt. |
Let’s now use alternation with grep. The options in Table 4-1, by the way, don’t work with grep, so you are going to use the original alternation pattern. To count the number of lines where the word the occurs, regardless of case, one or more times, use:
grep -Ec "(the|The|THE)" rime.txt
and get this answer:
327
This result does not tell the whole story. Stay tuned.
Here is an analysis of the grep command:
The -E option means that you want to use extended regular expressions (EREs) rather than basic regular expressions (BREs). This, for example, saves you from having to escape the parentheses and the vertical bar, like
\(THE\|The\|the\)
, as you must with BREs.The -c option returns a count of the matched lines (not matched words).
The parentheses group the choice or alternation of the, The, or THE.
The vertical bar separates possible choices, which are evaluated left to right.
To get a count of actual words used, this approach will return each occurrence of the word, one per line:
grep -Eo "(the|The|THE)" rime.txt | wc -l
This returns:
412
And here is a bit more analysis:
The
-o
option means to show only that part of the line that matches the pattern, though this is not apparent due to the pipe (|
) to wc.The vertical bar, in this context, pipes the output of the grep command to the input of the wc command. wc is a word count command, and
-l
counts the number of lines of the input.
Why the big difference between 327 and 412? Because -c gives you a count of matching lines, but there can be more than one match on each line. If you use -o with wc -l, then each occurrence of the various forms of the word will appear on a separate line and be counted, giving the higher number.
To perform this same match with Perl, write your command this way:
perl -ne 'print if /(the|The|THE)/' rime.txt
Or better yet, you can do it with the (?i)
option mentioned earlier, but without
alternation:
perl -ne 'print if /(?i)the/' rime.txt
Or even better yet, append the i modifier after the last pattern delimiter:
perl -ne 'print if /the/i' rime.txt
and you will get the same outcome. The simpler the better. For a list of additional modifiers (also called flags), see Table 4-2. Also, compare options (similar but with a different syntax) in Table 4-1.
Most often, when you refer to subpatterns in regular expressions, you are referring to a group or groups within groups. A subpattern is a pattern within a pattern. Often, a condition in a subpattern is matchable when a preceding pattern is matched, but not always. Subpatterns can be designed in a variety of ways, but we’re concerned primarily with those defined within parentheses here.
In one sense, the pattern you saw earlier:
(the|The|THE)
has three subpatterns: the is the first subpattern, The is the second, and THE the third, but matching the second subpattern, in this instance, is not dependent on matching the first. (The leftmost pattern is matched first.)
Now here is one where the subpattern(s) depend on the previous pattern:
(t|T)h(e|eir)
In plain language, this will match the literal characters t or T followed by an h followed by either an e or the letters eir. Accordingly, this pattern will match any of:
the
The
their
Their
In this case, the second subpattern (e|eir)
is dependent on the first (tT)
.
Subpatterns don’t require parentheses. Here is an example of subpatterns done with character classes:
\b[tT]h[ceinry]*\b
This pattern can match, in addition to the or
The, words such as thee,
thy and thence. The two word
boundaries (\b
) mean the pattern will
match whole words, not letters embedded in other words.
Here is a complete analysis of this pattern:
\b
matches a beginning word boundary.[tT]
is a character class that matches either an lowercase t or an uppercase T. We can consider this the first subpattern.Then the pattern matches (or attempts to match) a lowercase h.
The second or last subpattern is also expressed as a character class
[ceinry]
followed by a quantifier*
for zero or more.Finally, another word boundary
\b
ends the pattern.
Note
One interesting aspect of the state of regular expressions is that terminology, while usually close in meaning, can also range far. In defining subpattern and other terms in this book, I’ve examined a variety of sources and have tried to bring them together under one roof. But I suspect that there are some who would argue that a character class is not a subpattern. My take is they can function as subpatterns, so I lump them in.
When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. You can reuse that content if you wish by using a backreference, in the form:
\1
or:
$1
where \1
or $1
reference the first captured group, \2
or $2
reference the second captured group, and so on. sed
will only accept the \1
form, but Perl
accepts both.
Note
Originally, sed supported backreferences in
the range \1
through \9
, but that limitation does not appear to
exist any longer.
You have already seen this in action, but I’ll demonstrate it here again. We’ll use it to rearrange the wording of a line of the poem, with apologies to Samuel Taylor Coleridge. In the top text box in RegExr, after clicking the Replace tab, enter this pattern:
(It is) (an ancyent Marinere)
Scroll the subject text (third text area) down until you can see the highlighted line, and then in the second box, enter:
$2 $1
and you’ll see in the lowest box the line rearranged as:
an ancyent Marinere It is,
(See Figure 4-2.)
Here is how to accomplish the same result with sed:
sed -En 's/(It is) (an ancyent Marinere)/\2 \1/p' rime.txt
and the output will be:
an ancyent Marinere It is,
just as in RegExr. Let’s analyze the sed command to help you understand everything that is going on:
The -E option once again invokes EREs, so you don’t have to quote the parentheses, for example.
The -n option suppresses the default behavior of printing every line.
The substitute command searches for a match for the text “It is an ancyent Marinere,” capturing it into two groups.
The substitute command also replaces the match by rearranging the captured text in the output, with the backreference
\2
first, then\1
.The p at the end of the substitute command means you want to print the line.
A similar command in Perl will do the same thing:
perl -ne 'print if s/(It is) (an ancyent Marinere)/\2 \1/' rime.txt
Notice that this uses the \1
style syntax. You can, of course, use the $1
syntax, too:
perl -ne 'print if s/(It is) (an ancyent Marinere)/$2 $1/' rime.txt
I like how Perl lets you print a selected line without jumping through hoops.
I’d like to point out something about the output:
an ancyent Marinere It is,
The capitalization got mixed up in the transformation. Perl can fix
that with \u
and \l
. Here’s how:
perl -ne 'print if s/(It is) (an ancyent Marinere)/\u$2 \l$1/' rime.txt
Now the result looks much better:
An ancyent Marinere it is,
And here is why:
The
\l
syntax does not match anything, but it changes the character that follows to lowercase.The
\u
syntax capitalizes the character that follows it.The
\U
directive (not shown) turns the text string that follows into all uppercase.The
\L
directive (not shown) turns the text string that follows into all lowercase.
These directives remain in effect until another is found (like
\l
or \E
, the end of a quoted string). Experiment with these to see
how they work.
Named groups are captured groups with names. You can access those groups by name later, rather than by integer. I’ll show you how here in Perl:
perl -ne 'print if s/(?<one>It is) (?<two>an ancyent Marinere)/\u$+{two} \l$+{one}/' rime.txt
Let’s look at it:
Adding
?<one>
and?<two>
inside the parentheses names the groups one and two, respectively.$+{one}
references the group named one, and$+{two}
, the group named two.
You can also reuse named groups within the pattern where the group was named. I’ll show you what I mean. Let’s say you were searching for a string that contained six zeros all together:
000000
It’s a shallow example, but serves to show you how this works. So name a group of three zeros with this pattern (the z is arbitrary):
(?<z>0{3})
You can then use the group again like this:
(?<z>0{3})\k<z>
Or this:
(?<z>0{3})\k'z'
Or this:
(?<z>0{3})\g{z}
Try this in RegExr for quick results. All these examples will work. Table 4-3 shows many of the possibilities with named group syntax.
There are also groups that are non-capturing groups—that is, they don’t store their content in memory. Sometimes this is an advantage, especially if you never intend to reference the group. Because it doesn’t store its content, it is possible it may yield better performance, though performance issues are hardly perceptible when running the simple examples in this book.
Remember the first group discussed in this chapter? Here it is again:
(the|The|THE)
You don’t need to backreference anything, so you could write a non-capturing group this way:
(?:the|The|THE)
Going back to the beginning of this chapter, you could add an option to make the pattern case-insensitive, like this (though the option obviates the need for a group):
(?i)(?:the)
Or you could do it this way:
(?:(?i)the)
Or, better yet, the pièce de résistance:
(?i:the)
The option letter i can be inserted between the question mark and the colon.
Another kind of non-capturing group is the atomic group. If you are using a regex engine that does backtracking, this group will turn backtracking off, not for the entire regular expression but just for that part enclosed in the atomic group. The syntax looks like this:
(?>the)
When would you want to use atomic groups? One of the things that can really slow regex processing is backtracking. The reason why is, as it tries all the possibilities, it takes time and computing resources. Sometimes it can gobble up a lot of time. When it gets really bad, it’s called catastrophic backtracking.
You can turn off backtracking altogether by using a non-backtracking engine like re2 (http://code.google.com/p/re2/) or by turning it off for parts of your regular expression with atomic grouping.
Note
My focus in this book is to introduce syntax. I talk very little about performance tuning here. Atomic groups are mainly a performance consideration in my view.
In Chapter 5, you’ll learn about character classes.
That alternation allows a choice between two or more patterns
What options modifiers are and how to use them in a pattern
Different kinds of subpatterns
How to use capturing groups and backreferences
How to use named groups and how to reference them
How to use non-capturing groups.
A little about atomic grouping.
The Adobe AIR runtime lets you use HTML, JavaScript, Flash, and ActionScript to build web applications that run as standalone client applications without having to use a browser. Find out more at http://www.adobe.com/products/air.html.
Python (http://www.python.org) is an easy-to-understand, high-level programming language. It has a regular expression implementation (see http://docs.python.org/library/re.html).
.NET (http://www.microsoft.com/net) is a programming framework for the Windows platform. It, too, has a regular expression implementation (see http://msdn.microsoft.com/en-us/library/hs600312.aspx).
More advanced explanations of atomic grouping are available at http://www.regular-expressions.info/atomic.html and http://stackoverflow.com/questions/6488944/atomic-group-and-non-capturing-group.
Get Introducing Regular Expressions now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.