Search the Catalog
Mastering Regular Expressions

Mastering Regular Expressions

Powerful Techniques for Perl and Other Tools

By Jeffrey E. F. Friedl
1st Edition January 1997
1-56592-257-3, Order Number: 2573
368 pages, $34.95

Chapter 4
The Mechanics of Expression Processing

Now that we have some background under our belt, let's delve into the mechanics of how a regex engine really goes about its work. Here we don't care much about the Shine and Finish of the previous chapter; this chapter is all about the engine and the drive train, the stuff that grease monkeys talk about in bars. We'll spend a fair amount of time under the hood, so expect to get a bit dirty with some practical hands-on experience.

Start Your Engines!

Let's see how much I can milk this engine analogy for. The whole point of having an engine is so that you can get from Point A to Point B without doing much work. The engine does the work for you so you can relax and enjoy the Rich Corinthian Leather. The engine's primary task is to turn the wheels, and how it does that isn't really a concern of yours. Or is it?

Two Kinds of Engines

Well, what if you had an electric car? They've been around for a long time, but they aren't as common as gas cars because they're hard to design well. If you had one, though, you would have to remember not to put gas in it. If you had a gasoline engine, well, watch out for sparks! An electric engine more or less just runs, but a gas engine might need some babysitting. You can get much better performance just by changing little things like your spark plug gaps, air filter, or brand of gas. Do it wrong and the engine's performance deteriorates, or, worse yet, it stalls.

Each engine might do its work differently, but the end result is that the wheels turn. You still have to steer properly if you want to get anywhere, but that's an entirely different issue.

New Standards

Let's stoke the fire by adding another variable: the California Emissions Standards.1 Some engines adhere to California's strict pollution standards, and some engines don't. These aren't really different kinds of engines, just new variations on what's already around. The standard regulates a result of the engine's work, the emissions, but doesn't say one thing or the other about how the engine should go about achieving those cleaner results. So, our two classes of engine are divided into four types: electric (adhering and non-adhering) and gasoline (adhering and non-adhering).

California has rather strict standards regulating the amount of pollution a car can produce. Because of this, many cars sold in America come in "for California'' and "non-California'' models.

Come to think of it, I bet that an electric engine can qualify for the standard without much change, so it's not really impacted very much -- the standard just "blesses'' the clean results that are already par for the course. The gas engine, on the other hand, needs some major tweaking and a bit of re-tooling before it can qualify. Owners of this kind of engine need to pay particular care to what they feed it -- use the wrong kind of gas and you're in big trouble in more ways than one.

The impact of standards

Better pollution standards are a good thing, but they require that the driver exercise more thought and foresight (well, at least for gas engines, as I noted in the previous paragraph). Frankly, however, the standard doesn't impact most people since all the other states still do their own thing and don't follow California's standard... yet. It's probably just a matter of time.

Okay, so you realize that these four types of engines can be classified into three groups (the two kinds for gas, and electric in general). You know about the differences, and that in the end they all still turn the wheels. What you don't know is what the heck this has to do with regular expressions!

More than you might imagine.

Regex Engine Types

There are two fundamentally different types of regex engines: one called "DFA'' (the electric engine of our story) and one called "NFA'' (the gas engine). The details follow shortly (=>101), but for now just consider them names, like Bill and Ted. Or electric and gas.

Both engine types have been around for a long time, but like its gasoline counterpart, the NFA type seems to be used more often. Tools that use an NFA engine include Tcl, Perl, Python, GNU Emacs, ed, sed, vi, most versions of grep, and even a few versions of egrep and awk. On the other hand, a DFA engine is found in almost all versions of egrep and awk, as well as lex and flex. Table 4-1 lists a few common programs available for a wide variety of platforms and the regex engine that most versions use. A generic version means that it's an old tool with many clones -- I have listed notably different clones that I'm aware of.2

Where I could find them, I used comments in the source code to identify the author (or, for the generic tools, the original author). I relied heavily on Libes and Ressler's Life With Unix (Prentice Hall, 1989) to fill in the gaps.

As Chapter 3 illustrated, 20 years of development with both DFAs and NFAs resulted in a lot of needless variety. Things were dirty. The POSIX standard came in to clean things up by specifying clearly which metacharacters an engine should support, and exactly the results you could expect from them. Superficial details aside, the DFAs (our electric engines) were already well suited to adhere to the standard, but the kind of results an NFA traditionally provided were quite different from the new standard, so changes were needed. As a result, broadly speaking, we have three types of regex engines:


Table 4-1: Some Tools and Their Regex Engines
Program (Original) Author Version Regex Engine
awk Aho, Weinberger, Kernighan generic DFA
new awk Brian Kernighan generic DFA
GNU awk Arnold Robbins recent Mostly DFA, some NFA
MKS awk Mortice Kern Systems POSIX NFA
mawk Mike Brennan all POSIX NFA
egrep Alfred Aho generic DFA
MKS egrep Mortice Kern Systems POSIX NFA
GNU Emacs Richard Stallman all Trad. NFA (POSIX NFA available)
Expect Don Libes all Traditional NFA
expr Dick Haight generic Traditional NFA
grep Ken Thompson generic Traditional NFA
GNU grep Mike Haertel Version 2.0 Mostly DFA, but some NFA
GNU find GNU Traditional NFA
lex Mike Lesk generic DFA
flex Vern Paxson all DFA
lex Mortice Kern Systems POSIX NFA
more Eric Schienbrood generic Traditional NFA
less Mark Nudelman Variable (usually Trad. NFA)
Perl Larry Wall all Traditional NFA
Python Guido van Rossum all Traditional NFA
sed Lee McMahon generic Traditional NFA
Tcl John Ousterhout all Traditional NFA
vi Bill Joy generic Traditional NFA


POSIX standardized the workings of over 70 programs, including traditional regex-wielding tools such as awk, ed, egrep, expr, grep, lex, and sed. Most of these tools' regex flavor had (and still have) the weak powers equivalent to a moped. So weak, in fact, that I don't find them interesting for discussing regular expressions. Although they can certainly be extremely useful tools, you won't find much mention of expr, ed, and sed in this book. Well, to be fair, some modern versions of these tools have been retrofitted with a more-powerful flavor. This is commonly done to grep, a direct regex sibling of sed, ed, and expr.

On the other hand, egrep, awk, and lex were normally implemented with the electric DFA engine, so the new standard primarily just confirmed the status quo -- no big changes. However, there were some gas-powered versions of these programs which had to be changed if they wanted to be POSIX-compliant. The gas engines that passed the California Emission Standards tests (POSIX NFA) were fine in that they produced results according to the standard, but the necessary changes only enhanced their fickleness to proper tuning. Where before you might get by with slightly misaligned spark plugs, you now have absolutely no tolerance. Gasoline that used to be "good enough'' now causes knocks and pings. But so long as you know how to maintain your baby, the engine runs smoothly. And cleanly.

From the Department of Redundancy Department

At this point I'll ask you to go back and review the story about engines. Every sentence there rings with some truth about regular expressions. A second reading should raise some questions. Particularly, what does it mean that an electric DFA more or less "just runs?'' What kind of things affect a gas-powered NFA? How can I tune my NFA? What special concerns does an emissions-controlled POSIX NFA have? What's a "stalled engine'' in the regex world? Last, and certainly least, just what is the regex counterpart to Rich Corinthian Leather?

Match Basics

Before looking at the differences among these engine types, let's first look at their similarities. Certain aspects of the drive train are the same (or for all practical purposes appear to be the same), so these examples can cover all engine types.

About the Examples

This chapter is primarily concerned with a generic, full-function regex engine, so some tools won't support exactly everything presented. In my examples, the dip stick might be to the left of the oil filter, while under your hood it might be behind the distributor cap. Your goal is to understand the concepts so that you can drive and maintain your favorite regex package (and ones you find interest in later).

I'll continue to use Perl's notation for most of the examples, although I'll occasionally show others to remind you that the notation is superficial and that the issues under discussion transcend any one tool or flavor. To cut down on wordiness here, I'll rely on you to check Chapter 3 if I use an unfamiliar construct.

This chapter details the practical effects of how a match is carried out. It would be nice if everything could be distilled down to a few simple rules that could be memorized without needing to understand what is going on. Unfortunately, that's not the case. In fact, with all this chapter offers, I identify only two all-encompassing rules I can list for you:

  1. the earliest match wins

  2. the quantifiers are greedy

We'll look at these rules, their effects, and much more throughout this chapter. Let's start by diving into the details of the first rule.

Rule 1: The Earliest Match Wins

Let's get down to business with The First Rule of Regular Expressions:

The match that begins earliest wins

This rule says that any match that begins earlier in the string is always preferred over any plausible match that begins later. This rule doesn't say anything about how long the winning match might be (we'll get into that shortly), merely that among all the matches possible anywhere in the string, the one that begins the leftmost in the string is chosen. Actually, since more than one plausible match can start at the same earliest point, perhaps the rule should read "a match...'' instead of "the match...,'' but that sounds odd.

Here's how the rule comes about: the match is first attempted at the very beginning (just before the first character) of the string to be searched. "Attempted'' means that every permutation of the entire (perhaps complex) regex is tested starting right at that spot. If all possibilities are exhausted and a match is not found, the complete expression is re-tried starting from just before the second character. This full retry occurs at each position in the string until a match is found. A "no match'' result is reported only if no match is found after the full retry has been attempted at each position all the way to the end of the string (just after the last character).

Thus, when trying to match ORA against FLORAL, the first attempt at the start of the string fails (since ORA can't match FLO). The attempt starting at the second character also fails (it doesn't match LOR either). The attempt starting at the third position, however, does match, so the engine stops and reports the match: FLORAL.

If you didn't know this rule, results might sometimes surprise you. For example, when matching cat against

The dragging belly indicates your cat is too fat
the match is in indicates, not at the word cat that appears later in the line. This word cat could match, but the cat in indicate appears earlier in the string, so it is the one that is matched. For an application like egrep which cares only whether there is a match, not where the match might be, the distinction is irrelevant. For other uses, such as with a search-and-replace, it becomes paramount.

Remember, the regex is tried completely each time, so something such as fat|cat|belly|your matches `belly'

The dragging belly indicates your cat is too fat
rather than fat, even though fat is listed first among the alternatives. Sure, the regex could conceivably match fat and the others, but since they are not the earliest (starting furthest to the left) possible match, they are not the one chosen. As I said, the entire regex is attempted completely from one spot before moving along the string to try again from the next spot, and in this case that means trying each alternative fat, cat, belly, and your at each position before moving on.

The "Transmission'' and the Bump-Along

It might help to think of this rule as the car's transmission, connecting the engine to the drive train while adjusting for the gear you're in (or changing gears for you if it's an automatic -- perhaps the automotive equivalent of some internal optimizations we'll be talking about in the next chapter). The engine itself does the real work (turning the crank); the transmission transfers this work to the wheels.

The transmission's main work: the bump-along

If the engine can't find a match starting at the beginning of the string, it's the transmission that bumps the regex engine along to attempt a match at the next position in the string, and the next, and the next, and so on. Usually. If, for instance, a regex begins with a start-of-string anchor, the transmission can realize that any bump-along would be futile, for only the attempt at the start of the string could possibly be successful. This is an example of the "String/Line Anchors" optimization discussed in the next chapter.

Engine Pieces and Parts

An engine is made up of parts of various types and sizes. You can't possibly hope to understand how the whole thing works if you don't know much about the individual parts. In a regex, these parts are the individual units -- literal characters, quantifiers (star and friends), character classes, parentheses, and so on. The combination of these parts (and the engine's treatment of them) makes a regex what it is, so looking at ways they can be combined and how they interact is our primary interest. First, let's take a look at some of the individual parts:

Literal text
With a literal, non-metacharacter like z or !, the match attempt is simply "Does this literal character match the current text character?'' If your regex is only literal text, such as usa, it is treated as "u and then s and then a.'' It's a bit more complicated if you have the engine do a case-insensitive match, where b matches B and vice-versa, but it's still pretty straightforward.

Character classes, dot, and the like
Matching a character class is not difficult either. Regardless of the length of the character class, it still matches just one character.3 A character class represents a set of characters that can match. Characters are included explicitly, or in a negated class excluded explicitly. Dot is just a shorthand for a large character class that matches any character (any character except newline and/or null in some flavors), so it's not a problem either. The same applies to other shorthand conveniences such as \w, \W, \d, \D, \s, \S, and the like.

Actually, as we saw in the previous chapter (=>81), a POSIX collating sequence can match multiple characters, but this is not common.

Anchors
A few other metacharacters are almost as simple, but they don't actually match characters in the target string; rather, they match a position in the target string. This includes string/line boundaries (caret and dollar), as well as word boundaries \<, \b, and such. The tests are simple because, for the most part, they simply compare two adjacent characters in the target string.

Simple parentheses
Certain parentheses used only for capturing text, as opposed to those used merely for grouping, have some performance impact (discussed in Chapter 5), but for the most part, they don't change how the match is carried out.

No electric parentheses or backreferences

I'd like to first concentrate on the similarities among the engines, but as foreshadowing, I'll show an interesting difference. Capturing parentheses (and the associated backreferences) are like a gas additive -- they affect a gasoline engine, but they are irrelevant to an electric engine because it doesn't use gas in the first place. The way a DFA engine works completely precludes the concept of backreferences and capturing parentheses. It just can't happen.4 This explains why tools developed with DFAs don't provide these features. You'll notice that awk, lex, and egrep don't have backreferences or any $1 type functionality.

This does not, of course, mean that there can't be some mixing of technologies to try to get the best of both worlds. This is discussed in a sidebar on page 121.

You might, however, notice that GNU's version of egrep does support backreferences. It does so by having two complete engines under the hood! It first uses a DFA engine to see whether a match is likely, and then uses an NFA engine (which supports the full flavor, including backreferences) to confirm the match. Later in this chapter, we'll see why a DFA engine can't deal with backreferences or capturing, and why anyone ever bothers with such an engine at all. (It has some major advantages, such as being able to match very quickly.)

Rule 2: Some Metacharacters Are Greedy

So far, we've seen matching that is quite straightforward. It is also rather uninteresting -- you can't do much without involving more-powerful metacharacters such as star, plus, alternation, and so on. Their added power requires more information to understand them fully.

First, you need to know that the quantifiers (?, *, +, and {min,max}) are greedy. When one of these governs a subexpression, such as the a in a?, the (expr) in (expr)*, or the [0-9] in [0-9]+, there is a minimum number of matches that are required before it can be considered successful, and a maximum number that it will ever attempt to match. This has been mentioned in earlier chapters -- what's new here concerns The Second Rule of Regular Expressions:

Items that are allowed to match a variable number of times always attempt to match as much as possible. They are greedy.

In other words, the quantifiers settle for the minimum number of required matches if they have to, but they always attempt to match as many times as they can, up to their maximum allowed.

The only time they settle for anything less than their maximum allowed is when matching too much ends up causing some later part of the regex to fail. A simple example is using \<\w+s\> to match words ending with an `s', such as regexes. The \w+ alone is happy to match the entire word, but if it does, the s can not match. To achieve the overall match, the \w+ settles for matching regexes, thereby allowing s\>, and thus the full regex, to match.

If it turns out that the only way the rest of the regex can succeed is when the greedy construct in question matches nothing (and if zero matches are allowed, as with star, question, and {0,max} intervals), well, that's perfectly fine too. However, it turns out this way only if the requirements of some later subexpression forces the issue. Because the quantifiers always (or, at least, try to) take more than they minimally need, they are called greedy.

This rule has many useful (but sometimes troublesome) implications. It explains, for example, why [0-9]+ matches the full number in March·1998. Once the 1 has been matched, the plus has fulfilled its minimum requirement, but because it tries for its maximum number of matches, it continues and matches the 998 before being forced to stop by the end of the string. (Since [0-9] can't match the nothingness at the end of the string, the plus finally stops.)

A subjective example

Of course, this method of grabbing things is useful for more than just numbers. Let's say you have a line from an email header and want to check whether it is the subject line. As we saw in earlier chapters, you simply use ^Subject:·. However, if you use ^Subject:·(.*), you can later access the text of the subject itself via the tool's after-the-fact parenthesis memory (for example, $1 in Perl).5

This example uses capturing as a forum for presenting greediness, so the example itself is appropriate only for NFAs (because only NFAs support capturing). The lessons on greediness, however, apply to all engines, including the non-capturing DFA.

Before looking at why .* matches the entire subject, be sure to understand that once the ^Subject:· part matches, you're guaranteed that the entire regular expression will eventually match. You know this because there's nothing after ^Subject:· that could cause the expression to fail -- .* can never fail since the worst case of "no matches" is still considered successful for star.

Why do we even bother adding .*? Well, we know that because star is greedy, it attempts to match dot as many times as possible, so we use it to "fill" $1. In fact, the parentheses add nothing to the logic of what the regular expression matches -- in this case we use them simply to capture the text matched by .*.

Once .* hits the end of the string, the dot isn't able to match, so the star finally stops and lets the next item in the regular expression attempt to match (for even though the starred dot could match no further, perhaps a subexpression later in the regex could). Ah, but since it turns out that there is no next item, we reach the end of the regex and we know that we have a successful match.


So, with a variable $line holding a string such as

Subject: Re: happy birthday

the Perl code
if ( $line =~ m/^Subject: (.*)/ ) {
    print "The subject is: $1\n";
}

produces `The subject is: Re: happy birthday'.

To make the example more concrete, here's the snippet in Tcl

if [regexp "^Subject: (.*)" $line all exp1] {
    puts "The subject is: $exp1"
}

and in Python:
reg = regex.compile("Subject: \(.*\)")
if reg.match(line) > 0:
    print "The subject is:", reg.group(1)

As you can see, each language handles regular expressions in its own way, but the concept (and result) of greediness stays the same.

Regarding replies

To extend this example, let's consider bypassing that pesky `Re:·' that most mail systems prepend to a subject to indicate the message is a reply. Our goal is to ignore any `Re:·' that might begin the subject, printing only the "real" subject part.

We can use greediness to our advantage and take care of this right in the regex. Consider ^Subject:·(Re:·)?(.*), with (Re:·)? added before (.*). Both subexpressions are greedy, but (Re:·)? gets to be greedy first, nabbing `Re:·' before .* takes what's left. In fact, we could use (Re:·)* just as well, which scoops up all the Re: that might have been stacked up over the course of back and forth replies.

The parentheses in (Re:·)? are intended only for grouping, but they still count as a set of parentheses. This means that our original parentheses, which grab what .* matches, become the second set. This, in turn, means that the subject of our interest becomes $2, not $1. So, if our code is

if ( $line =~ m/^Subject: (Re: )?(.*)/ ) {
    print "The subject is: $2\n";
}
we get `The subject is: happy birthday'.


You might even imagine something like:

if ( $line =~ m/^Subject: (Re: )?(.*)/ ) {
    # we have a match -- alter our report depending upon $1
    if ($1 eq "Re: ") {
        print "The reply is: $2\n";
    } else {
        print "The subject is: $2\n";
    }
}

Even if the set of parentheses that fills $1 is not able to match, the second set still stuffs its matched text into $2.


As a final comparison, let's look at the same expression with one of the parentheses moved:


Since sets of parentheses are always labeled by counting open parentheses from the left, the Re:· parentheses become $2, and the whole subject becomes $1. However, this time the `Re:·' that might be matched into $2 is also within the first set of parentheses, so $1 will have that same `Re:·' (as well as the rest of the subject). Although this isn't useful with our example so far, it would be if you wanted to access the subject with any `Re:·' intact, but also want a simple way to tell whether it is a reply.

Being too greedy

Let's get back to the concept of a greedy quantifier being as greedy as it can be. Consider how the matching and results would change if we add another .*: ^Subject:·(.*).*. The answer is: nothing would change. The initial .* (inside the parentheses) is so greedy that it matches all the subject text, never leaving anything for the second .* to match. Again, failing to match is not a problem, since the star does not require a match to be successful. Were the second .* in parentheses as well, the resulting $2 would always be empty.

Does this mean that after .*, a regular expression can never have anything that is expected to actually match? No, of course not. It is possible for something later in the regex to force something previously greedy to give something back (that is, relinquish or "unmatch") if that's what is necessary to achieve an overall match.

Let's consider the possibly useful ^.*([0-9][0-9]), which matches the last two digits in the string, wherever they might be, and saves them to $1. Here's how it works: At first, .* matches the entire string. Because the following ([0-9][0-9]) is required, its failure to match, in effect, tells .* "Hey, you took too much! Give me back something so that I can have a chance to match." Greedy components first try to take as much as they can, but they always defer to the greater need to achieve an overall match. They're just stubborn about it, and only do so when forced. Of course, they'll never give up something that hadn't been optional in the first place, such as a plus quantifier's first match.

With this in mind, let's apply ^.*([0-9][0-9]) to `about·24·characters·long'. Once .* matches the whole string, the requirement for the first [0-9] to match forces .* to give up `g' (the last thing it had matched). That doesn't, however, allow [0-9] to match, so .* is again forced to relinquish something, this time the n in long. This cycle continues 15 more times until .* finally gets around to giving up 4.

Unfortunately, even though the first [0-9] can then match, the second still cannot. So, .* is forced to relinquish more in search of the overall match. This time .* gives up the 2, which the first [0-9] can match. Now, the 4 is free for the second [0-9] to match, and the entire expression matches `about·24·char...', with $1 getting `24'.

First come, first served

Consider changing that regex to ^.*([0-9]+), ostensibly to match not just the last two digits, but the last whole number however long it might be. It won't work. As before, .* is forced to relinquish some of what it had matched because the subsequent [0-9]+ requires a match to be successful. With the `about·24·char...' example, that means unmatching until the 4. Like before, [0-9] can then match. Unlike before, there's then nothing further that must match, so .* is not forced to give up the 2 or any other digits it might have matched. Were .* to do so, the [0-9]+ would certainly be a grateful recipient, but nope, first come first served. Greedy constructs are very possessive -- once something is in their clutches, they'll give it up if forced, but never just to be nice.

If this feels counter-intuitive, realize that [0-9]+ is just one match away from [0-9]*, which is in the same league as .*. Substituting into ^.*([0-9]+), we get ^.*(.*) as our regex, which looks suspiciously like the ^Subject:·(.*).* from a page or so ago.

Getting down to the details

I should clear up a few things here. Phrases like "the .* gives up...'' and "the [0-9] forces...'' are slightly misleading. I used these terms because they're easy to grasp, and the end result appears to be the same as reality. However, what really happens behind the scenes depends on the basic engine type, DFA or NFA. So it's time to see what these really are.

Regex-Directed vs. Text-Directed

The two basic engine types reflect a fundamental difference in how one might apply a regular expression in checking a string. I call the gasoline-driven NFA engine "regex-directed,'' and the electric-driven DFA "text-directed.''

NFA Engine: Regex-Directed

Let's consider one way an engine might match to(nite|knight|night) against the text `...tonight...'. Starting with the t, the regular expression is examined one component at a time, and the "current text'' is checked to see whether it matches. If it does, the next component is checked, and so on, until all components have matched, indicating that an overall match has been achieved.

With the to(nite|knight|night) example, the first component is t, which repeatedly fails until a t is reached. Once that happens, the o is checked against the next character, and if it matches, control moves to the next component. In this case, the "next component'' is (nite|knight|night) which really means "nite or knight or night.'' Faced with three possibilities, the engine just tries each in turn. We (humans with advanced neural nets between our ears) can see that if we're matching tonight, the third alternative is the one that leads to a match. Despite their brainy origins (=>60), a regex-directed engine can't come to that conclusion until actually going through the motions to check.

Attempting the first alternative, nite, involves the same component-at-a-time treatment as before: "Try to match n, then i, then t, and finally e.'' If this fails, as it eventually does, the engine tries another alternative, and so on until it achieves a match or must report failure. Control moves within the regex from component to component, so I call it "regex-directed.''

The control benefits of an NFA engine

In essence, each subexpression of a regex in a regex-directed match is checked independently of the others -- their only interrelation is that of proximity to one another by virtue of being subexpressions thrown together to make a single regex. The layout of the components is what controls an engine's overall movement through a match.

Since the regex directs the NFA engine, the driver (the writer of the regular expression) has considerable opportunity to craft just what he or she wants to happen. (Chapter 5 is devoted to this very subject.) What this really means may seem vague now, but it will all be spelled out just after the mysteries of life are revealed

DFA Engine: Text-Directed

Contrast the regex-directed NFA engine with an engine that, while scanning the string, keeps track of all matches "currently in the works.'' In the tonight example, the engine knows a possible match has started the moment it hits t:

in string in regex
after ...t|onight... possible matches: t|o(nite|knight|night)


Each subsequent character scanned updates the list of possible matches. After a few more characters are matched, the situation becomes

in string in regex
after ...toni|ght... possible matches: to(ni|te|knight|ni|ght)


with two possible matches in the works (and one alternative, knight, ruled out). Yet, with the g that follows, only the third alternative remains viable. Once the h and t are scanned as well, the engine realizes it has a complete match and can return success.

I call this "text-directed'' matching because each character scanned controls the engine. As in the example above, a partial match might be the start of any number of different, yet possible, matches. Matches that are no longer viable are pruned as subsequent characters are scanned. There are even situations where a "partial match in progress'' is also a full match. With to(...)?, for example, if any match in the works ends inside the parentheses, a full match (of `to') is already confirmed and in reserve in case the longer matches don't pan out.

If you reach a character in the text that invalidates all the matches in the works, you must either: 1) revert to one of the full matches in reserve, or, failing that, 2) declare that there are no matches at the attempt's starting point.

Foreshadowing

If you compare these two engines based only on what I've mentioned so far, you might conclude that the text-directed DFA engine is generally faster. The regex-directed NFA engine might waste time attempting to match different subexpressions against the same text (such as the three alternatives in the example).

You would be right. During the course of an NFA match, the same character of the target might be checked by many different parts of the regex (or even by the same part, over and over). Even if a subexpression can match, it might have to be applied again (and again and again) as it works in concert with the rest of the regex to find a match. A local subexpression can fail or match, but you just never know about the overall match until you eventually work your way to the end of the regex. (You know, if I could find a way to include "It's not over until the fat lady sings.'' in this paragraph, I would.) On the other hand, a DFA engine is determinate -- each character in the target is checked once (at most). When a character matches, you don't know yet if it will be part of the final match (it could be part of a possible match that doesn't pan out), but since the engine keeps track of all possible matches in parallel, it need be checked only once, period.

The Mysteries of Life Revealed

The foreshadowing in the last section might have been a bit thick, so I'd better come clean now, at least about some of it. The two basic technologies behind regular-expression engines have the somewhat imposing names Nondeterministic Finite Automaton (NFA) and Deterministic Finite Automaton (DFA). With mouthfuls like this, you see why I stick to just "NFA'' and "DFA.'' We won't be seeing these phrases spelled out again.6

I suppose I could explain the underlying theory that goes into these names... if I only knew it! As I hinted, the word determinate is pretty important, but for the most part the theory is not so long as we understand the practical effects. By the end of this chapter, we will. However, do see the sidebar on page 104.

Because of the regex-directed nature of an NFA, the details of how the engine attempts a match are very important. As I said before, the writer can exercise a fair amount of control simply by changing how the regex is written. With the tonight example, perhaps less work would have been wasted had the regex been written differently, such as to(ni(ght|te)|knight), tonite|toknight|tonight, or perhaps to(k?night|nite). With any given text, they all end up matching exactly the same text, but in doing so direct the engine in different ways. At this point, we don't know enough to judge which regexes, if any, are better than others, but that's coming soon.

It's the exact opposite with a DFA -- since the engine keeps track of all matches simultaneously, none of these differences in representation matter so long as in the end they all represent the same possible matches. There could be a hundred different ways to achieve the same result, but since the DFA keeps track of them all simultaneously (almost magically -- more on this later), it doesn't matter which form the regex takes. To a pure DFA, even expressions that appear as different as abc and [aa-a](b|b{1}|b)c are utterly indistinguishable.

It all boils down to...

Three things come to my mind when describing a DFA engine:
 · DFA matching is very fast
 · DFA matching is very consistent
 · Talking about DFA matching is very boring

(I'll eventually expand on all these points.)

The regex-directed nature of an NFA makes it interesting to talk about. NFAs provide plenty of room for creative juices to flow. There are great benefits in crafting an expression well, and even greater penalties for doing it poorly. A gasoline engine is not the only engine that can stall and conk out completely. To get to the bottom of this, we need to look at the essence of an NFA engine: backtracking.

Backtracking

The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options, it selects one and remembers the other to return to later if need be. If the attempted option is successful and the rest of the regex is also successful, you are finished with the match. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the option and can continue with the match by trying another. This way, it eventually tries all possible permutations of the regex (or at least as many as needed until a match is found).

A Really Crummy Analogy

Backtracking is like leaving a pile of bread crumbs at every fork in the road. If the path you choose turns out to be a dead end, you can retrace your steps, giving up ground until you come across a pile of crumbs that indicates an untried path. Should that path, too, turn out to be a dead end, you can continue to backtrack, retracing your steps to the next pile of crumbs, and so on, until you eventually find a path that leads to your goal or until you run out of untried paths.

There are various situations when the regex engine needs to choose between two (or more) options -- the alternation we saw earlier is only one example. Another example is that upon reaching ...x?..., the engine must decide whether it should attempt x or not. Upon reaching ...x+..., however, there is no question about trying to match x at least once -- the plus requires at least one match, and that's non-negotiable. Once the first x has been matched, though, the requirement is lifted and it then must decide to match another x or not. If it decides to match, it must decide if it will then attempt to match another... and another... and so on. At each of these many decision points, a virtual "pile of crumbs'' is left behind as a reminder that another option (to match or not to match, whichever wasn't chosen at each point) remains viable at that point.

A crummy little example

Let's look at a full example using our earlier to(nite|knight|night) regex on the string `hot·tonic·tonight!' (silly, yes, but a good example). The first component t is attempted at the start of the string. It fails to match h, so the entire regex fails at that point. The engine's transmission then bumps along to retry the regex from the second position (which also fails), and again at the third. This time the t matches, but the subsequent o fails in matching the `·', so again the whole attempt fails.

The attempt that eventually starts at ...|tonic... is more interesting. Once the to has been matched, the three alternatives become three available options. The regex engine picks one to try, remembering the others ("leaving some bread crumbs'') in case the first fails. For the purposes of discussion, let's say that the engine first chooses nite. That expression breaks down to "n + i + t ...,'' which gets to ...toni|c... before failing. Unlike the earlier failures, this failure doesn't mean the end of the overall attempt because other options still remain. The engine chooses one, we'll say knight, but it fails right away. That leaves only one final option, night, but it eventually fails. Since that was the final untried option, its failure means the failure of the entire attempt starting at ...|tonic..., so the transmission kicks in again.

Once the engine gets to the attempt starting at ...|tonight! it gets interesting again, but this time, the night alternative successfully matches to the end. The successful matching to the end of the regex means an overall match, so the engine can report success at that point.

Two Important Points on Backtracking

The general idea of how backtracking works is fairly simple, but some of the details are quite important for real-world use. Specifically, when faced with multiple choices, which choice should be tried first? Secondly, when forced to backtrack, which saved choice should the engine use?

In situations where the decision is between "make an attempt'' and "skip an attempt,'' as with items governed by a question, star, and the like, the engine always chooses to first make the attempt. It will return later (to try skipping the item) only if forced by the overall need to reach a global expression-wide match.

This simple rule has far-reaching repercussions. For starters, it helps explain regex greediness, but not completely. To complete the picture, we need to know which (among possibly many) saved options to use when we backtrack. Simply put:

The most recently saved option is the one returned to when a local failure forces backtracking. It's LIFO (last in first out).

This is easily understood in the crummy analogy -- if your path becomes blocked, you simply retrace your steps until you come across a pile of bread crumbs. The first you'll return to is the most recently laid. The traditional analogy for describing LIFO also holds: like stacking and unstacking dishes, the most-recently stacked will be the first you'll unstack.

NFA: Theory vs. Reality

The true mathematical and computational meaning of "NFA'' is different from what is commonly called an "NFA regex engine.'' In theory, NFA and DFA engines should match exactly the same text and have exactly the same features. In practice, the desire for richer, more expressive regular expressions has caused their semantics to diverge. We'll see several examples later in this chapter, but one right off the top is support for backreferences.

As a programmer, if you have a true (mathematically speaking) NFA regex engine, it is a relatively small task to add support for backreferences. A DFA's engine's design precludes the adding of this support, but an NFA's common implementation makes it trivial. In doing so, you create a more powerful tool, but you also make it decidedly nonregular (mathematically speaking). What does this mean? At most, that you should probably stop calling it an NFA, and start using the phrase "nonregular expressions,'' since that describes (mathematically speaking) the new situation. No one has actually done this, so the name "NFA'' has lingered, even though the implementation is no longer (mathematically speaking) an NFA.

What does all this mean to you, as a user? Absolutely nothing. As a user, you don't care if it's regular, nonregular, unregular, irregular, or incontinent. So long as you know what you can expect from it (something this chapter will show you), you know all you need to care about.

For those wishing to learn more about the theory of regular expressions, the classic computer-science text is chapter 3 of Aho, Sethi, and Ullman's Compilers -- Principles, Techniques, and Tools (Addison-Wesley, 1986), commonly called "The Dragon Book'' due to the cover design. More specifically, this is the "red dragon''. The "green dragon'' is its predecessor, Aho and Ullman's Principles of Compiler Design.

Saved States

In NFA regular expression nomenclature, the piles of bread crumbs are known as saved states. A state indicates where a test can restart from if need be. It reflects both the position in the regex and the point in the string where an untried option begins. Because this is the basis for NFA matching, let me go over what I've already said with some simple but verbose examples. If you're comfortable with the discussion so far, feel free to skip ahead.

A match without backtracking

Let's look a simple example, matching ab?c against abc. Once the a has matched, the current state of the match is reflected by:

at `a|bc' matching a|b?c


However, now that b? is up to match, the regex engine has a decision to make: attempt the b or not. Well, since ? is greedy, it attempts the match. But so that it can recover if that attempt fails or eventually leads to failure, it adds

at `a|bc' matching ab?|c


to its otherwise empty list of saved states. This indicates that the engine can later pick up the match in the regex just after the b?, picking up in the text from just before the b (that is, where it is now). Thus, in effect, skipping the b as the question mark allows.

Once the engine carefully places that pile of crumbs, it goes ahead and checks the b. With the example text, it matches, so the new current state becomes:

at `ab|c' matching ab?|c


The final c matches as well, so we have an overall match. The one saved state is no longer needed, so it is simply forgotten.

A match after backtracking

Now, if `ac' had been the text to match, everything would have been the same until the b attempt was made. Of course, this time it wouldn't match. This means that the path that resulted from actually attempting the ...? failed. Since there is a saved state available to return to, this "local failure'' does not mean overall failure. The engine backtracks, meaning that it takes the most recently saved state as its new current state. In this case, that would be the

at `a|c' matching ab?|c


that had been saved as the untried option before the b had been attempted. This time, the c and c match up, so the overall match is achieved.

A non-match

Now let's look at the same expression, but against abX. Before the b is attempted, the question mark causes the state

at `a|bX' matching ab?|c


to be saved. The b matches, but that avenue later turns out to be a dead end because the c fails to match X. The failure results in a backtrack to the saved state. The engine next tests c against the b that the backtrack effectively "unmatched.'' Obviously, this test fails, too. If there were other saved states, another backtrack would occur, but since there aren't any, the overall match at the current starting position is deemed a failure.

Are we done? No. The engine's transmission will still do its "bump along the string and retry the regex,'' which might be thought of as a pseudo-backtrack. The match restarts at:

at `a|bX' matching |ab?c


The whole match is attempted again from the new spot, but like before, all paths lead to failure. After the next two attempts (from ab|X and abX|) similarly fail, a true overall failure is finally reported.

Backtracking and Greediness

For tools that use this NFA regex-directed backtracking engine, understanding how backtracking works with your regular expression is the key to writing expressions that accomplish what you want, and accomplish it quickly. We've seen how ? greediness works, so let's look at star (and plus) greediness.

Star, plus, and their backtracking

If you consider x* to be more or less the same as x?x?x?x?x?x?... (or, more appropriately, (x(x(x(x...?)?)?)?)?),7 it's not too different from what we have already seen. Before checking the item quantified by the star, the engine saves a state indicating that if the check fails (or leads to failure), the match can pick up after the star. This is done over and over, until the star match actually does fail.

Just for comparison, remember that a DFA doesn't care much about the form you use to express which matches are possible; the three examples are identical to a DFA.

Thus, when matching [0-9]+ against `a·1234·num', once [0-9] fails trying to match the space after the 4, there are four saved states indicating that the match can pick up in the regex at [0-9]+| at each of the string positions:

a 1|234 num
a 12|34 num
a 123|4 num
a 1234| num
These represent the fact that the attempt of [0-9] had been optional at each of these positions. When it fails to match the space, the engine backtracks to the most recently saved state (the last one listed), picking up at `a·1234|·num' in the text and at [0-9]+| in the regex. Well, that's at the end of the regex. Now that we're actually there and notice it, we realize that we have an overall match.

Note that in the above list of four string positions, `|1234·num' is not a member because the first match using the plus quantifier is required, not optional. Would it have been in the list had the regex been [0-9]*? (hint: it's a trick question) ¤ Turn the page to check your answer.

Where does [0-9]* match?

¤ Answer to the question on page 106.

No, `|1234·num' would not be part of a saved state during a match of [0-9]*. I posed this question because it's a mistake that new users commonly make.

Remember, a component that has star applied can always match. If that's the entire regex, the entire regex can always match. This certainly includes the attempt when the transmission applies the engine the first time, at the start of the string. In this case, the regex matches at `|a·1234·num' and that's the end of it -- it never even gets as far the digits.

Revisiting a fuller example

With our more detailed understanding, let's revisit the ^.*([0-9][0-9]) example from page 97. This time, instead of just pointing to "greediness'' to explain why the match turns out as it does, we can use our knowledge of the mechanics of a match to explain why in precise terms. (If you're feeling a bit snowed in with the details, feel free to skim ahead for the general feel of what this chapter offers -- you can review the examples in more detail later on.)

I'll use `CA·95472,·USA' as an example. Once the .* has successfully matched to the end of the string, there are a dozen saved states accumulated from the star-governed dot matching 12 things that are (if need be) optional. These states note that the match can pick up in the regex at ^.*|([0-9][0-9]), and in the string at each point where a state was created.

Now that we've reached the end of the string and pass control to the first [0-9], the match obviously fails. No problem: we have a saved state to try (a baker's dozen of them, actually). We backtrack, resetting the current state to the one most recently saved, to just before where .* matched the final A. Skipping that match (or "unmatching'' it, if you like) leaves us trying that A against the first [0-9]. It fails.

This backtrack-and-test cycle continues until the engine effectively unmatches the 2, at which point the first [0-9] can match. The second, however, can't, so we must continue to backtrack. It's now irrelevant that the first [0-9] matched during the previous attempt -- the backtrack resets the current state to before the first [0-9]. As it turns out, the same backtrack resets the string position to just before the 7, so the first [0-9] can match again. This time, so can the second (matching the 2). Thus, we have a match: `CA·95472,·USA', with $1 getting 72.

A few observations: first, the backtracking also entailed maintaining the status of the text being matched by the subexpression within parentheses. The backtracks always caused the match to be picked up at ^.*|([0-9][0-9]). As far as the simple match attempt is concerned, this is the same as ^.*|[0-9][0-9], so I used phrases such as "picks up before the first [0-9].'' However, moving in and out of the parentheses involves updating the status of what $1 should be, and this impacts efficiency. This is discussed fully in the next chapter (=> 150).

It is important to realize that something governed by star (or any of the quantifiers) first matches as much as it can without regard to what might follow in the regex. In our example, the .* does not magically know to stop at the first digit, or the second to the last digit, or any other place until the dot finally fails. We saw this earlier when looking at how ^.*([0-9]+) would never match more than a single digit by the [0-9]+ part (=> 98).

More About Greediness

Many concerns (and benefits) of greediness are shared by both an NFA and a DFA. I'd like to look at some ramifications of greediness for both, but with examples explained in terms of an NFA. The lessons apply to a DFA just as well, but not for the same reasons. A DFA is greedy, period, and there's not much more to say after that. It's very constant to use, but pretty boring to talk about. An NFA, however, is interesting because of the creative outlet its regex-directed nature provides. An NFA engine affords the regex author direct control over how a match is carried out. This provides many benefits, as well as some efficiency-related pitfalls. (Discussions of efficiency are taken up in the next chapter.)

Despite these differences, the match results are often similar. For the next few pages, I'll talk of both engine types, but offer the description in the more easily understandable regex-directed terms of an NFA. By the end of this chapter, we'll have a firm grasp of just when the results might differ, as well as exactly why.

Problems of Greediness

As we saw with the last example, .* always marches to the end of the line.8 This is because .* just thinks of itself and grabs what it can, only later giving up something if it is required to achieve an overall match.

Or, with a tool where a dot can also match a newline, and strings that contain multi-line data, it matches through all the logical lines to the end of the string.

Sometimes this can be a real pain. Consider a regex to match text wrapped in doublequotes. At first, you might want to write ".*", but knowing what we know about .*, guess where it will match in:
  The name "McDonald's" is said "makudonarudo" in Japanese

Actually, since we understand the mechanics of matching, we don't need to guess -- we can know. Once the initial quote matches, .* is free to match, and immediately does so all the way to the end of the string. It will back off (or, perhaps more appropriately, be backed off by the regex engine) only as much as is needed until the final quote can match. Running this through in your head, you realize that it will match

The name "McDonald's" is said "makudonarudo" in Japanese
which is obviously not the doublequoted string that was intended. This is one reason why I caution against the overuse of .*, as it can often lead to surprising results if you don't pay careful attention to greediness.

So, how can we have it match only "McDonald's"? The key is to realize that we don't want "anything'' between the quotes, but rather "anything except a quote.'' If we use [^"]* rather than .*, it won't overshoot the closing quote.

The regex engine's basic approach with "[^"]*" is exactly the same as before. Once the initial quote matches, [^"]* gets a shot at matching as much as it can. In this case, that's up to the quote after McDonald's, at which point it finally stops because [^"] can't match the quote. So, at that point, control moves to the closing quote. It happily matches, resulting in overall success:

The name "McDonald's" is said "makudonarudo" in Japanese

Multi-Character "Quotes''

In the first chapter, I talked a bit about matching HTML tags, such as with the sequence ...<B>very</B>... causing the "very'' to be rendered in bold if the browser can do so. Attempting to match a <B>...</B> sequence seems similar to matching a quoted string, except the "quotes'' in this case are the multi-character sequences <B> and </B>. Like the quoted string example, multiple sets of "quotes'' cause problems:
...<B>Billions</B> and <B>Zillions</B> of suns...
If we use <B>.*</B>, the greedy .* causes the match-in-progress to zip to the end of the line, backtracking only far enough to allow the </B> to match, matching the last </B> on the line instead of the one corresponding to the opening <B> at the start of the match.

Unfortunately, since the closing delimiter is more than one character, we can't solve the problem in the same way as we did with doublequoted strings. We can't expect something as ridiculous as <B>[^</B>]*</B> to work. A character class represents only one character and not the full </B> sequence that we want.9

Don't let the apparent structure of [^</B>] fool you. It is just a class to match one character, any character except <, >, /, and B. It is the same as, say [^/<>B], and certainly won't work as an "anything not </B>'' construct.

Laziness?

This whole problem arises because star and friends (the quantifiers) are greedy. For a moment, let's imagine they are "lazy'' (or "lax'' or "minimal-matching'' or "non-greedy'' or "ungreedy'' or whatever you'd like to call it). With a lazy <B>.*</B> and the
...<B>Billions</B> and <B>Zillions</B> of suns...
example, after the initial <B> has matched, a lazy .* would immediately decide that since it didn't require any matches, it would lazily not bother trying to perform any. So, it would immediately pass control to the following <.

The < wouldn't match at that point, so control would return back to the lazy .* where it still had its untried option to attempt a match (to attempt multiple matches, actually). It would begrudgingly do so, with the dot then matching ...<B>Billions.... Again, the star has the option to match more, or to stop. We're assuming it's lazy for the example, so it first tries stopping. The subsequent < still fails, so .* has to again exercise its untried match option. After eight cycles, .* will have eventually matched Billions, at which point the subsequent < (and the whole </B> subexpression) will finally be able to match:

...<B>Billions</B> and <B>Zillions</B> of suns...

So, as we've seen, the greediness of star and friends can be a real boon at times, while troublesome at others. Having non-greedy, lazy versions is wonderful, as they allow you to do things that are otherwise very difficult (or even impossible). As it turns out, Perl provides ungreedy quantifiers in addition to the normal greedy versions. Like most great inventions, the idea is simple; we just had to wait for someone to think of it (in this case, Perl's author, Larry Wall).

Unfortunately, if you are not using Perl and don't have a lazy star quantifier, you are still faced with how to solve the <B>...</B> multi-character quote problem. Frankly, it is quite difficult to solve using a single, straight regular expression -- I recommend splitting the work into two parts, one to find the opening delimiter, the other to search from that point to find the closing delimiter.

Greediness Always Favors a Match.

Recall the price-fixing (so to speak) example from Chapter 2 (=> 46). Due to floating-point representation problems, values that should have been "1.625'' or 3.00'' were sometimes coming out like "1.62500000002828'' and "3.00000000028822''. To fix this, I used
$price =~ s/(\.\d\d[1-9]?)\d*/$1/
to lop off all but the first two or three decimal digits from the value stored in the variable $price. The \.\d\d matches the first two decimal digits regardless, while the [1-9]? matches the third digit only if it is non-zero.

I then noted:


Anything matched so far is what we want to keep, so we wrap it in parentheses to capture to $1. We can then use $1 in the replacement string. If this is the only thing that matches, we replace exactly what was matched with itself -- not very useful. However, we go on to match other items outside the $1 parentheses. They don't find their way to the replacement string, so the effect is that they're removed. In this case, the "to be removed'' text is any extra digits, the \d* at the end of the regex.

So far so good, but let's consider what happens when the contents of the variable $price is already well-formed. When it is 27.625, the (\.\d\d[1-9]?) part matches the entire decimal part. Since the trailing \d* doesn't match anything, the substitution replaces the `.625' with `.625' -- an effective no-op.

This is the desired result, but wouldn't it be just a bit more efficient to do the replacement only when it would have some real effect? In such a case, the \d* would have to actually match at least one digit, since only it matches text to be omitted.

Well, we know how to write "at least one digit''! Simply replace \d* with \d+:

$price =~ s/(\.\d\d[1-9]?)\d+/$1/

With crazy numbers like "1.62500000002828'', it still works as before, but with something such as "9.43'', the trailing \d+ isn't able to match, so rightly, no substitution occurs. So, this is a great modification, yes? No! What happens with a three-digit decimal value like 27.625?

Stop for a moment to work through the match of 27.625 yourself.

In hindsight, the problem is really fairly simple. Picking up in the action once (\.\d\d[1-9]?)\d+ has matched 27.625, we find that \d+ can't match. That's no problem for the regex engine, since the match of `5' by [1-9] was optional and there is still a saved state to try. This is the option of having [1-9]? match nothing, leaving the 5 to fulfill the must-match-one requirement of \d+. Thus, we get the match, but not the right match: .625 is replaced by .62, and the value becomes incorrect.

The lesson here is that a match always takes precedence over a non-match, and this includes taking from what had been greedy if that's what is required to achieve a match.10

10  A feature I think would be useful, but that no regex flavor that I know of has, is what I would call possessive quantifiers. They would act like normal quantifiers except that once they made a decision that met with local success, they would never backtrack to try the other option. The text they match could be unmatched if their enclosing subexpression was unmatched, but they would never give up matched text of their own volition, even in deference to an overall match. A possessive question mark would have solved the problem in (\.\d\d[1-9]?)\d+.

Is Alternation Greedy?

The only major control I haven't yet discussed in depth is |, alternation. How alternation works is an important point because it can work in fundamentally different ways with different regex engines. When alternation is reached, any number of the alternatives might be able to match in the string, but which will? The Second Rule of Regular Expressions refers to quantifiers (star and friends) but not alternation. Is alternation greedy?

Let's look at an NFA engine. When faced with alternation, it tries each alternative in turn. You can pretty much count on each alternative being checked in the order given in the expression. Let's say the regex is ^(Subject|Date):·. When the alternation is reached, the first alternative, Subject, is attempted. If it matches, the rest of the regex, , is given a chance. If it turns out that it can't match, and if other alternatives remain (in this case, Date), the regex engine will backtrack to try them. This is just another case of the regex engine backtracking to a point where untried options are still available. This continues until an overall match is achieved, or until all options (in this case, alternatives) are exhausted.

What text will actually be matched by tour|to|tournament when applied to the string `three·tournaments·won'? All the alternatives are attempted (and fail) during each attempt (at the 1st character position, 2nd, 3rd, and so on) until the transmission starts the attempt at `three·|tournaments·won'. This time, the first alternative, tour, matches. Since the alternation is the last thing in the regex, the moment the tour matches, the whole regex is done. The other alternatives are not even tried again.

So, we see that alternation is not greedy, at least not for an NFA. Well, to be specific, alternation is not greedy for a Traditional NFA. Greedy alternation would have matched the longest possible alternative (tournament), wherever in the list it happened to be. A POSIX NFA, or any DFA would have indeed done just that, but I'm getting a bit ahead of myself.

To make sure you are on your toes, let me ask: which kind of alternation would result in tour|to|tournament matching the same text as to(ur(nament)?)?? Before answering, make sure you realize that both are logically the same: they can match tour, to, and tournament, but nothing else. The question here is, in practice, which text will to(ur(nament)?)? actually match: tour (as with non-greedy alternation), tournament (as with greedy alternation), or something else altogether? ¤ Turn the page to check your answer.

Is to(ur(nament)?)? Greedy?

¤ Answer to the question on page 112.

Once the initial to has matched, we know that an overall match is guaranteed since nothing in the regex that follows is required. Although this may well end up being the final overall match, the engine can't decide that yet, as there is still the possibility that more can be matched -- the question marks are greedy, so they attempt to match what they quantify.

The (ur(nament)?) quantified by the outermost question mark matches if possible, and within that, the nament also matches if possible. So overall, the entire to + ur + nament matches if at all possible. In practice, it matches the same text as a greedy-alternation tour|to|tournament: the longest possible.

Uses for Non-Greedy Alternation

Let's revisit the (\.\d\d[1-9]?)\d* example from page 110. If we realize that \.\d\d[1-9]?, in effect, says "allow either \.\d\d or \.\d\d[1-9]'', we can rewrite the entire expression as (\.\d\d|\.\d\d[1-9])\d*. (There is not a compelling reason to make this change -- it's merely a handy example.) Is it really the same as (\.\d\d[1-9]?)\d*? If alternation is greedy, then yes, it is, but the two are quite different with non-greedy alternation.

Let's consider it as non-greedy for the moment. If the first alternative, \.\d\d, is able to match, the \d* that follows the alternation will certainly succeed. This might include matching a non-zero digit (which, if you'll recall the original problem, is a digit we really want to match only within the parentheses). Also, realize that the second alternative begins with a copy of the entire first alternative -- if the first alternative fails, the second will certainly fail as well. The regex engine will nevertheless make the attempt, but I'll leave that issue of efficiency to the next chapter.

Interestingly, if we use (\.\d\d[1-9]|\.\d\d)\d*, which is the same except that the alternatives have been swapped, we do effectively get a replica of the original (\.\d\d[1-9]?)\d*. The alternation has meaning in this case because if the first alternative fails due to the [1-9], the second alternative still stands a chance.

In distributing the [1-9]? to two alternatives and placing the shorter one first, we fashioned a non-greedy ? of sorts. It ends up being meaningless in this particular example because there is nothing that could ever allow the second alternative to match if the first fails. I see this kind of faux-alternation often, and it is invariably a mistake. In one book I've read, a*((ab)*|b*) is used as an example in explaining something about regex parentheses. A rather silly example, isn't it? Since the first alternative, (ab)*, can never fail, any other alternatives (just b* in this case) are utterly meaningless. You could add

  a*((ab)*|b*|.*|partridge·in·a·pear·tree|[a-z])

and it wouldn't change the meaning a bit.

Non-greedy alternation pitfalls

You can often use non-greedy alternation to your advantage by crafting just the match you want, but non-greedy alternation can also lead to unexpected pitfalls for the unaware. Consider wanting to match a January date of the form `Jan 31'. We need something more sophisticated than, say, Jan·[0123][0-9], as that allows "dates'' such as `Jan 00', `Jan 39', and disallows, say, `Jan 7'.

One simple way to match the date part is to attack it in sections. To match from the first through the ninth, using 0?[1-9] allows a leading zero. Adding [12][0-9] allows for the tenth through the 29th, and 3[01] rounds it out. Putting it all together, we get Jan·(0?[1-9]|[12][0-9]|3[01]).

Where do you think this matches in `Jan 31 is my dad's birthday'? We want it to match `Jan 31', of course, but non-greedy alternation actually matches only `Jan 3'. Surprised? During the match of the first alternative, 0?[1-9], the leading 0? fails, but the alternative matches because the subsequent [1-9] has no trouble matching the 3. Since that's the end of the expression, the match is complete.

Were the order of the alternatives reversed, or the alternation greedy, this problem would not have surfaced. Another approach to our date-matching task could be Jan·(31|[123]0|[012]?[1-9]). Like the first solution, this requires careful arrangement of the alternatives to avoid the problem. Yet a third approach is Jan·(0[1-9]|[12][0-9]?|3[01]?|[4-9]), which works properly regardless of the ordering. Comparing and contrasting these three expressions can prove quite interesting (an exercise I'll leave for your free time, although the "A Few Ways to Slice and Dice a Date'' sidebar on page 115 should be helpful).

A Few Ways to Slice and Dice a Date

A few approaches to the date-matching problem posed on page 114. The calendar associated with each regex has what can be matched by each alternative color-coded with the regex.


Greedy Alternation in Perspective

As we've seen, non-greedy alternation is more powerful than greedy alternation because you have more control over just how a match is attempted -- it doesn't say "Take any of these'' so much as "Try this, then that, and finally the other.''

With an NFA, alternation can entail a lot of backtracking, and finding ways to reduce it is usually a good way to make the regex more efficient, which means faster execution. We'll see some examples soon, and more in the next chapter.

Character Classes vs. Alternation

Because of the superficial similarity between [abc] and a|b|c, you might tend to think that character classes are implemented similarly, but with an NFA nothing could be further from the truth. With a DFA, it makes no difference one way or the other, but with an NFA a character class is an atomic unit which acts like a sieve, allowing a character to pass only if it is one of the target characters. This test is, in effect, done in parallel. It is much more efficient than the comparable NFA alternation, which checks each alternative in turn (entailing a lot of backtracking).

NFA, DFA, and POSIX

"The Longest-Leftmost''

Let me repeat: when the transmission starts a DFA engine from some particular point in the string, if any match is to be found from that position, the DFA will find the longest possible, period. Since it's the longest from among all possible matches that start equally furthest to the left, it's the "longest-leftmost'' match.

Really, the longest

Issues of which match is longest aren't confined to alternation. Consider how an NFA matches the (horribly contrived) one(self)?(selfsufficient)? against the string oneselfsufficient. An NFA first matches one and then the greedy (self)?, leaving (selfsufficient)? left to try against sufficient. That doesn't match, but that's okay since it is optional. So, the Traditional NFA returns oneselfsufficient and discards the untried states. (A POSIX NFA, on the other hand, is another story that we'll get to shortly.)

On the other hand, a DFA finds the longer oneselfsufficient. If the (self)? were to be non-greedy and go unmatched, the (selfsufficient)? would be able to match, yielding a longer overall match. That is the longest possible, so is the one that a DFA finds.

I chose this silly example because it's easy to talk about, but I want you to realize that this issue is important in real life. For example, consider trying to match continuation lines. It's not uncommon for a data specification to allow one logical line to extend across multiple real lines if the real lines end with a backslash before the newline. As an example, consider the following:11

SRC=array.c builtin.c eval.c field.c gawkmisc.c io.c main.c \
        missing.c msg.c node.c re.c version.c

11  The actual text is irrelevant to the example, but for what it's worth, this is from the GNU awk makefile.

You might normally want to use ^\w+=.* to match this kind of "var = value'' assignment line, but this regex doesn't consider the continuation lines. I'm assuming, for the example, that the tool's dot won't match a newline -- you could substitute it with something like [^\n] if need be.

To match continuation lines, you might want to try appending (\\\n.*)* to the regex. Ostensibly, this says that any number of additional logical lines are allowed so long as they follow an escaped newline. This seems reasonable, but it will never work with a traditional NFA. By the time the original .* has reached the newline, it has already passed the backslash, and nothing in what was added forces it to backtrack. Yet a DFA would find the longer multi-line match if it existed, simply because it was, indeed, the longest.

POSIX and the Longest-Leftmost Rule

The POSIX standard requires that if you have multiple possible matches that start at the same position, the one matching the most text must be the one returned. Period.

The standard uses the phrase "longest of the leftmost.''12 It doesn't say you have to use a DFA, so if you want to use an NFA when creating a tool, what's a programmer to do? If you want to implement a POSIX NFA, you'd have to find the full oneselfsufficient and all the continuation lines, despite these results being "unnatural'' to your NFA.

12  The rationale associated with the standard uses the phrase "leftmost-longest,'' which is incorrect English considering what we know they're trying to say. It means "of all the equally 'longest', choose the leftmost,'' which is quite different from "longest of the leftmost.''

A Traditional NFA engine stops with the first match it finds, but what if it were to continue to try all the remaining options? Each time it reached the end of the regex, it would have another plausible match. By the time all options are exhausted, it could simply report the longest of the plausible matches it had found. Thus, a POSIX NFA.

An NFA applied to the first example would, after matching (self)?, have saved an option noting that it could pick up matching one(self)?|(selfsufficient)? at one|selfsufficient. Even after finding the oneselfsufficient that a Traditional NFA returns, a POSIX NFA would continue to exhaustively check the remaining options, eventually realizing that yes, there was a way to match the longer oneselfsufficient.

Really, the leftmost

Not only does POSIX mandate that the longest-leftmost match be found, but that if subexpression capturing (via parentheses) is supported, each captured subexpression must capture the maximum amount of text consistent with the overall leftmost-longest, and with its place in the regex. This means that the overall match is chosen based only on the longest-leftmost rule, but once chosen, the first set of capturing parentheses gets the maximum possible from that. After that, the second set gets the maximum possible of what's left. And so on.

(to|top)(o|polo)?(gical|o?logical) can match `topological' in various ways, but a POSIX engine would have to match `topological' as shown (the part matched by each parenthesized expression is marked). Compare this to, say the `topological' that a Traditional NFA would find. The former's first parentheses' top is longer than the latter's to, so it would have to be the particular match chosen for a POSIX match. Similarly, even though it could then match nothingness in the second set of parentheses (matching ological in the third), it takes the longest match that is consistent with both the longest overall match and the earlier parentheses taking their longest match.

Note, though, that many POSIX engines do not support any kind of capturing, so this issue is normally not a concern.

Speed and Efficiency

If efficiency is an issue with a Traditional NFA (and with all that backtracking, believe me, it is), it is doubly so with a POSIX NFA since there can be so much more backtracking. A POSIX NFA engine really does have to try every possible permutation of the regex every time. Examples in the next chapter show that poorly written regexes can suffer extremely severe performance penalties.

DFA efficiency

The text-directed DFA is a really fantastic way around all the inefficiency of backtracking. It gets its matching speed by keeping track of all possible ongoing matches at once. How does it achieve this magic?

The DFA engine spends extra time and memory before a match attempt to analyze the regular expression more thoroughly (and in a different way) than an NFA. Once it starts actually looking at the string, it has an internal map describing "If I read such-and-such a character now, it will be part of this-and-that possible match.'' As each character of the string is checked, the engine simply follows the map.

Building that map (called compiling the regex, something that must be done for an NFA as well, but it's not nearly as complex) can sometimes take a fair amount of time and memory, but once done for any particular regular expression, the results can be applied to an unlimited amount of text. It's sort of like charging the batteries of your electric car. First, your car sits in the garage for a while, plugged into the wall like a Dust Buster, but when you actually use it, you get consistent, clean power.

DFA and NFA in Comparison

Both DFA and NFA engines have their good and bad points:

Differences in the pre-use compile

Before applying a regex to a search, both types of engines compile the regex to an internal form suited to their respective matching algorithms. An NFA compile is generally faster, and requires less memory. There's no real difference between a Traditional and POSIX NFA compile.

Differences in match speed

For simple literal-match tests in "normal'' situations, both types match at about the same rate. A DFA's matching speed is unrelated to the particular regex, while an NFA's is directly related. For a Traditional NFA to conclude that there is no match, it must try every possible permutation of the regex. This is why I spend the entire next chapter on techniques to write an NFA regex that will match quickly. As we'll see, an NFA match can sometimes take forever (well, almost). At least a Traditional NFA can stop if and when it finds a match. A POSIX NFA, on the other hand, must always try every possible permutation to ensure that it has found the longest possible match, so it generally takes the same (possibly very long) amount of time to complete a successful match as it does completing a failed attempt. Writing efficient regexes is doubly important for a POSIX NFA.

In one sense, I speak a bit too strongly in the previous paragraph, since optimizations can often reduce the work needed to return an answer. We've already seen the optimization of not trying ^-anchored regexes beyond the start of the string (=>92), and we'll see many more in the next chapter. In general, the need for optimizations is less pressing with a DFA (since its matching is so fast to begin with), but for the most part, the extra work done during the DFA's pre-use compile affords better optimizations than most NFA engines take the trouble to do.

Modern DFA engines often try to reduce the time and memory used during the compile by postponing some work until a match is attempted. Often, much of the compile-time work goes unused because of the nature of the text actually checked. A fair amount of time and memory can sometimes be saved by postponing the work until it's actually needed during the match. (The technobabble term for this is lazy evaluation.) It does, however, create cases where there can be a relationship between the regex and the match speed.

Differences in what is matched

A DFA (or anything POSIX) finds the longest leftmost match. A Traditional NFA might also, or it might find something else. Any individual engine will always treat the same regex/text combination in the same way, so in that sense it's not "random,'' but other NFA engines may decide to do slightly different things. Virtually all NFA engines I've seen work exactly the way I've described here,13but it's not something absolutely guaranteed by technology or any standard.

13  I have seen two tools employ slightly different engines. Older versions of GNU awk (gawk), such as version 2.15.6, had neither greedy nor non-greedy alternation -- it seemed rather random what alternative would match. The other is MIFES, a popular Japanese editor. Some versions sometimes turn .*X into [^X]*X (in an effort, I suppose, to make regexes seem more "natural'' to those that don't understand them).

Differences in capabilities

An NFA engine can support many things that a DFA cannot. Among them are:

Differences in implementation ease

Although they have limitations, simple versions of DFA and NFA engines are easy enough to understand and to implement. The desire for efficiency (both in time and memory) and enhanced features drives the implementation to greater and greater complexity. With code length as a metric, consider that the NFA regex support in the Version 7 (January 1979) edition of ed was less than 350 lines of C code. (For that matter, the entire source for grep was a scant 478 lines.) Henry Spencer's 1986 freely available implementation of the Version 8 regex routines was almost 1,900 lines of C, and Tom Lord's 1992 POSIX NFA package rx (used in GNU sed, among other tools) is a stunning 9,700 lines long. For DFA implementations, the Version 7 egrep regex engine was a bit over 400 lines long, while Henry Spencer's 1992 full-featured POSIX DFA package is over 4,500 lines long. To provide the best of both worlds, GNU egrep Version 2.0 utilizes two fully functional engines (about 8,300 lines of code).

Simple, however, does not necessarily mean "lack of features.'' I recently wanted to use regular expressions for some text processing with Delphi, Borland's Pascal development environment. I hadn't used Pascal since college, but it still didn't take long to write a simple NFA regex engine. It doesn't have a lot of bells and whistles, and it is not built for maximum speed, but the flavor is relatively full-featured and so even the simple package is quite usable.

DFA Speed With NFA Capabilities: Regex Nirvana?

I've said several times that a DFA can't provide capturing parentheses or backreferences. This is quite true, but it certainly doesn't preclude hybrid approaches which mix technologies in an attempt to reach regex nirvana. The sidebar on page 104 told how NFAs have diverged from the theoretical straight and narrow in search of more power, and it's only natural that the same happens with DFAs. A DFAs construction makes it more difficult, but that doesn't mean impossible.

GNU grep takes a simple but effective approach. It uses a DFA when possible, reverting to an NFA when backreferences are used. GNU awk does something similar -- it uses GNU grep's fast shortest-leftmost DFA engine for simple "does it match'' checks, and reverts to a different engine for checks where the actual extent of the match must be known. Since that other engine is an NFA, GNU awk can conveniently offer capturing parentheses, and it does via its special gensub function.

Until recently, there seemed to be little practical work done on extending the DFA itself, but there is active research in this area. As I finish up work on this book, Henry Spencer writes that his recent and mostly DFA package supports capturing parentheses, and is "at worst quadratic in text size, while an NFA is exponential.'' This new technology needs more time to mature before it becomes widely available, but it holds promise for the future.

Practical Regex Techniques

Now that we've touched upon the basic concerns for writing regular expressions, I'd like to put this understanding to work and move on to more advanced techniques for constructing regular expressions. I'll try to pick up the pace a bit, but be aware that some of the issues are still fairly complex.

Contributing Factors

Writing a good regex involves striking a balance among several concerns:

These concerns are often context-dependent. If I'm working on the command line and just want to grep something quickly, I'll probably not care if I match a bit more than I need, and I won't usually be too concerned to craft just the right regex for it. I'll allow myself to be sloppy in the interest of time, since I can quickly peruse the output for what I want. However, when I'm working on an important script, it's worth the time and effort to get it right: a complex regular expression is okay if that's what it takes. There is a balance among all these issues.

Even in a script, efficiency is also context-dependent. For example, with an NFA, something long like ^-(display|geometry|cemap|...|quick24|random|raw)$ to check command-line arguments is inefficient because of all that alternation, but since it is only checking command-line arguments (something done perhaps a few times at the start of the program) it wouldn't matter if it took 100 times longer than needed. It's just not an important place to worry much about efficiency. Were it used to check each line of a potentially large file, the inefficiency would penalize you for the duration of the program.

Be Specific

Continuing with the continuation-line example from page 116, we found that ^\w+=.*(\\\n.*)* applied with a Traditional NFA wouldn't match both lines of:
SRC=array.c builtin.c eval.c field.c gawkmisc.c io.c main.c \
        missing.c msg.c node.c re.c version.c

The problem is that the first .* matches past the backslash, pulling it out from under the (\\\n.*)* that we want it to be matched by. If we don't want to match past the backslash, we should say that in the regex:15

15  Notice how I made sure to include \n in the class? You'll remember that one of the assumptions of the original regex was that dot didn't match a newline, and we don't want its replacement to match a newline either (=> 79).

  ^\w+=[^\n\\]*(\\\n[^\n\\]*)*

This might be too specific, though, since it doesn't allow backslashes except those at the end of lines. You can take advantage of a regex-directed NFA's non-greedy alternation, starting over with the original expression and changing the [^\n\\]* to (\\\n|.)*, which gives us:

  ^\w+=(\\\n|.)*

The part we appended earlier, added to match the escaped newlines and their subsequent continuation lines, is now unneeded -- the main (\\\n|.)* matches right through newlines, but only if they are escaped.

Well, not exactly. A line ending with \\ (not common, but possible) happens to have a backslash before the newline, but the newline is not escaped so there is no continuation line. The problem is that the dot will match the first backslash, allowing the second to match \\\n on the next cycle of the star, resulting in a false continuation line. We weren't specific enough -- if we want that the second alternative should not match an escape, we should say that using the [^\n\\] from before.

The problem with (\\\n|[^\n\\])* is that it now doesn't allow anything to be escaped except for newlines. We really want the first alternative to say "any escaped byte.'' If dot doesn't match newline, don't make the silly mistake of trying [\n.]. If octal escapes are supported, (\\[\000-\377]|[^\n\\])* works.

Finally, a hint to what the next chapter covers: Since there's now no ambiguity between the two alternatives, we may as well swap them so that the one likely to be used most often comes first. This will allow an NFA to match a bit more quickly.

Matching an IP address

As another example that we'll take much further, let's match an IP (Internet Protocol) address: four numbers separated by periods, such as 1.2.3.4. Often, the numbers are padded to three digits, as in 001.002.003.004. If you want to check a string for one of these, you could use [0-9]*\.[0-9]*\.[0-9]*\.[0-9]*, but that's so vague that it even matches `and then.....?'. Look at the regex: it doesn't even require any numbers to exist -- its only requirements are three periods (with nothing but digits, if anything, between).

To fix this regex, we first change the star to a plus, since we know that each number must have at least one digit. To ensure that the entire string is only the IP address, we wrap the regex with ^...$. This gives us:

  ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$

Using Perl's \d as a shorthand for [0-9], this becomes the more-readable16 ^\d+\.\d+\.\d+\.\d+$, but it still allows things that aren't IP addresses, like 1234.5678.9101112.131415. (Each number must be in the range of 0-255.) To enforce that each number is three digits long, you could use

16  Or maybe not -- it depends on what you are used to. If you are new to regular expressions, at first they all seem odd. Perl has perhaps the richest regular expression flavor to be found, and lacking enough symbols to serve as metacharacters, many items such as \d combine a backslash and a letter for their representation. To some, these added "features'' are merely superficial gimmicks that add more backslashes. Personally, I don't like a lot of backslashes either, but I enjoy the features (superficial or not) and so use them.

  ^\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d$

but now we are too specific. We still need to allow one- and two-digit numbers (as in 1.2.3.4). If the tool's regex flavor provides the {min,max} notation, you can use ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$, and if not, you can always use \d\d?\d? or \d(\d\d?)? for each part. Either regex allows one to three digits, but in a slightly different way.

Depending on your needs, you might be happy with some of the various degrees of vagueness in the expressions so far (just as my editor was comfortable with the simple expression he used on page 5). If you really want to be strict, you have to worry that \d\d\d can match 999, which is above 255, and thus an invalid component of an IP address.

Several approaches would ensure that only numbers from 0 to 255 appear. One silly approach is 0|1|2|3|...253|254|255. Actually, this doesn't allow the zero-padding that is allowed, so you really need 0|00|000|1|01|001|..., which is even more ridiculous. For a DFA engine, it is ridiculous only in that it's so long and verbose -- it still matches just as fast as any regex describing the same text. For an NFA, however, all the alternation kills efficiency.

A realistic approach would concentrate on which digits are allowed in a number, and where. If a number is only one or two digits long, there is no worry as to whether the value is within range, so \d|\d\d takes care of them. There's also no worry about the value for a three-digit number beginning with a 0 or 1, since such a number is in the range 000-199. This lets us add [01]\d\d, leaving us with \d|\d\d|[01]\d\d. You might recognize this as being similar to the date example earlier in this chapter (=> 115), and the time example in Chapter 1 (=> 24).

Continuing, a three-digit number beginning with a 2 is allowed if the number is 255 or less, so a second digit less than 5 means the number is valid. If the second digit is 5, the third must be less than 6. This can all be expressed as 2[0-4]\d|25[0-5].

This may seem confusing at first, but the approach makes sense when you think about it. The result is \d|\d\d|[01]\d\d|2[0-4]\d|25[0-5]. Actually, we can combine the first three alternatives to yield [01]?\d\d?|2[0-4]\d|25[0-5]. Doing so is more efficient for an NFA, since any alternative that fails results in a backtrack. Note that using \d\d? in the first alternative, rather than \d?\d, allows an NFA to fail just a bit more quickly when there is no digit at all. I'll leave the analysis to you -- walking through a simple test case with both should illustrate the difference. We could do other things to make this part of the expression more efficient, but I'll leave that facet of the discussion for the next chapter.

Now that we have a subexpression to match a single number from 0 through 255, we can wrap it in parentheses and insert it in place of each \d{1,3} in the earlier regex. This gives us (broken across lines to fit the width of the page):

 ^([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])$

Quite a mouthful! Was the trouble worth it? You have to decide yourself based upon your own needs. This still allows 0.0.0.0, which is invalid because all the digits are zero, but creating a regex to disallow this would be much tougher. You can't just disallow 0 from each number, since something like 123.202.0.188 is valid. At some point, depending on your needs, you have to decide when it is not worth the trouble to be more specific -- the cost/benefit ratio starts to suffer from diminishing returns. Sometimes it's better to take some of the work out of the regex. For example, going back to ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ and wrapping each component in parentheses will stuff the numbers into $1, $2, $3, and $4, which can then be validated by other programming constructs.

One thing to consider, though, is the negative lookahead that Perl provides. It can disallow specific cases before the engine even tries the "main'' expression. In this case, prepending (?!0+\.0+\.0+\.0+$) causes immediate failure when every component is zero. This is explained further in Chapter 7 (=> 230).

Know your context

It's important to realize that the two anchors are required to make this regex work. Without them, it would be more than happy to match ip=72123.3.21.993, or for a Traditional NFA, even ip=123.3.21.223.

In that second case, it does not even fully match the final 223 that should have been allowed. Well, it is allowed, but there's nothing (such as a separating period, or the trailing anchor) to force that match. The final group's first alternative, [01]?\d\d?, matched the first two digits, and then that was the end of the regex. Like with the date-matching problem on page 114, we can arrange the order of the alternatives to achieve the desired effect. In this case, that would be such that the alternatives matching three digits come first, so any proper three-digit number is matched in full before the two-digit-okay alternative is given a chance.

Rearranged or not, the first mistaken match is still a problem. "Ah!'' you might think, "I can use word boundary anchors to solve this problem.'' Probably not. Such a regex could still match 1.2.3.4.5.6. To disallow embedded matches, you must ensure the surrounding context, and word boundaries are not enough. Wrapping the entire regex in (^|·)...(·|$) is one idea, but what constitutes a good solution'' depends on the situation.

Difficulties and Impossibilities

Pinpointing what to specify in a regex can sometimes be difficult when you want almost anything, as we saw with the ".*" example. Actually, in that example, we didn't really want "anything,'' but rather "anything except a doublequote,'' and so it was best to say that: "[^"]*".

Unfortunately, sometimes you can't express things so clearly. For example, if you want to allow escaped quotes within the text, such as with "he·is·6'4\"·tall", the [^"]* would never be allowed past the escaped (nor any other) doublequote, and you could match too little (only "he·is·6'4\"·tall" in this case). A larger problem is when what you don't want is more than a single character, such as with the <B>...</B> example from page 109. We'll look at these kinds of problems and their solutions in the next section.

Matching balanced sets of parentheses, brackets, and the like presents another difficulty. Wanting to match balanced parentheses is quite common when parsing many kinds of configuration files, programs, and such. Take, for example, wanting to do some processing on all of a function's arguments when parsing a language like C. Function arguments are wrapped in parentheses following the function name, and may themselves contain parentheses resulting from nested function calls or generic math grouping. At first, ignoring that they may be nested, you might be tempted to use:

  \bfoo\([^)]*\)

In hallowed C tradition, I use foo as the example function name. The marked part of the expression is ostensibly meant to match the function's arguments. With examples such as foo(2,·4.0) and foo(somevar,·3.7), it works as expected. Unfortunately, it also matches foo(bar(somevar),·3.7), which is not as we want. This calls for something a bit "smarter'' than [^)]*.

To match the parenthesized expression part you might consider the following regular expressions (among others):

1. \(.*\) literal parentheses with anything in between
2. \([^)]*\) from an opening parenthesis to the next closing parenthesis  
3. \([^()]*\) from an opening parenthesis to the next closing parenthesis, but no other opening parentheses allowed in between


Figure 4-1 illustrates where these match against a sample line of code.



Figure 4-1: Match locations of our sample regexes


We see that regex #1 matches too much,17 and regex #2 matches too little. Regex #3 doesn't even match successfully -- in isolation it would match `(this)', but because it must come immediately after the foo, it fails. So, none of these expressions works. In fact, the problem is that you simply can't match arbitrarily nested constructs with regular expressions. It just can't be done.

17  The use of .* should set off warning alarms to pay particular attention to decide whether dot is really what you want to apply star to. Sometimes that is exactly what you need, but .* is often used inappropriately.

You can construct a regular expression that matches nested constructs up to a certain depth, but not to any arbitrary level of nesting. A regular expression to allow just one level of nesting is the monstrosity \([^()]*(\([^()]*\)[^()]*)*\) (or, if you are not concerned about efficiency, or if you are using a DFA where it doesn't matter, you could use \(([^()]|\([^()]*\))*\) just as well), so the thought of having to worry about further levels of nesting is frightening.18 Sometimes, you just have to use other, non-regex methods.

18  Here's a little Perl snippet that, given a $depth, creates a regex to match up to that many levels of parentheses beyond the first:  '\(' . '([^()]|\(' x $depth . '[^()]*' . '\))*' x $depth . '\)'
Analysis left to the reader.

Watching Out for Unwanted Matches

It's very easy to forget what happens if the text is not formed just as you expect. Let's say you are writing a filter for converting a text file to HTML, and you want to replace a line of hyphens by <HR>, which represent a horizontal rule (a line across the page). If you used a s/-*/<HR>/ search-and-replace command, it would replace the sequences you wanted, but only when they're at the beginning of the line. Surprised? In fact, s/-*/<HR>/ will add <HR> to the beginning of every line, whether they begin with a sequence of hyphens or not!

Remember, anything that isn't required will always be considered successful. The first time -* is attempted at the start of the string, it matches any hyphens that are there. However, if there aren't any, it will still be happy to successfully match nothing. That's what star is all about.

Let's look at a similar example. I recently read a new book by a respected author in which he describes a regular expression to match a number, either an integer or floating-point. As his expression is constructed, such a number has an optional leading minus sign, any number of digits, an optional decimal point, and any number of digits that follow. His regex is -?[0-9]*\.?[0-9]*.

Indeed, this matches such examples as 1-272.37129238843..191919, and even something like -.0. This is all good, and as expected.

However, how do you think it will match in `this·test·has·no·number', `nothing·here', or even an empty string? Look at the regex closely -- everything is optional. If a number is there, and if it is at the beginning of the string, it will be matched, but nothing is required. This regex can match all three non-number examples, matching the nothingness at the beginning of the string each time. In fact, it even matches nothingness at the beginning of the string with an example like `num·123', since that nothingness matches earlier than the number would.

So, it's important to say what you really mean. A floating-point number must have at least one digit in it, or it is not a number(!). To construct our regex, let's first assume there is at least one digit before the decimal point. If so, we need to use plus for those digits: -?[0-9]+.

Writing the subexpression to match an optional decimal point (and subsequent digits) hinges on the realization that any numbers after the decimal point are contingent upon there being a decimal point in the first place. If we use something naïve like \.?[0-9]*, the [0-9]* gets a chance to match regardless of the decimal point's presence. The solution is, again, to say what we mean. A decimal point (and subsequent digits, if any) is optional: (\.[0-9]*)?. Here, the question mark no longer quantifies (that is, governs) only the decimal point, but instead quantifies the combination of the decimal point plus any following digits. Within that combination, the decimal point is required; if it is not there, [0-9]* never even gets a chance to match.

Putting this all together, we have -?[0-9]+(\.[0-9]*)?. This still doesn't allow something like `.007', since our regex requires at least one digit before the decimal point. If we change the left side to allow zero digits, we will have to change the right side so it doesn't, since we can't allow all digits to be optional (the problem we were trying to correct in the first place).

The solution is to add an alternative which allows for the uncovered situation: -?[0-9]+(\.[0-9]*)?|-?\.[0-9]+. This now also allows just a decimal point followed by (this time not optional) digits. Details, details. Did you notice that I allowed for the optional leading minus in the second alternative as well? That's easy to forget. Of course, you could instead bring the -? out of the alternation, as in -?([0-9]+(\.[0-9]*)?|\.[0-9]+).

Although this is an improvement on the original, you could still have problems, depending on how you use it. Often, a regex author has the context for the regex in mind, and so assumes something about the way it will be used, usually that it will be part of something larger where there are items on either side that disallow matches embedded in things that we really don't want to match. For example, our unadorned regex for floating-point numbers will be more than happy to match the string `1997.04.12'.

Yet, if we use this regex in a specific situation, such as in matching a data line where commas separate fields, wrapping our regex with ,...,, or perhaps even better (^|,)...(,|$), avoids embedding problems.

Matching Delimited Text

The IP address and "[^"]*" regexes we've examined are just two examples of a whole class of matching problem that often arises: the desire to match text delimited (or perhaps separated) by some other text. Other examples include:

In general, the requirements for such tasks can be phrased along the lines of:

  1. match the opening delimiter

  2. match the main text
    (which is really "match anything that is not the ending delimiter'')

  3. match the ending delimiter

As I mentioned earlier, the "match anything not the ending delimiter'' can become complicated when the ending delimiter is more than one character, or in situations where it can appear within the main text.

Allowing escaped quotes in doublequoted strings

Let's look at the 2\"x3\" example, where the ending delimiter is a quote, yet can appear within the main part if escaped.

The opening and closing delimiters are still the simple quotes, but matching the main text without matching the ending delimiter is the issue. Thinking clearly about which items the main text allows, we know that if a character is not a doublequote (in other words, if it's [^"]), it is certainly okay. However, if it is a doublequote, it is okay if preceded by a backslash.

Unfortunately, there is no way to indicate "if preceded'' in a regular expression. Were lookbehind possible, it would be helpful in exactly this kind of situation. Unfortunately, no one supports lookbehind yet. You can say "if followed'' simply by appending what follows, or by using lookahead if supported (such as with Perl). There is no magic here; it is merely a matter of perspective. With the same view, abc becomes "a is okay if followed by b and then by c.'' This might sound like a silly distinction, but it is quite important. It means that we can't write "A doublequote is okay if preceded by a backslash.'' but we can say "A backslash followed by a doublequote is okay.'' That's written as \\". (Remember, a backslash is a metacharacter, so if we want our regex to match a literal backslash, we have to escape the backslash itself with another backslash.)

So, for the main-text part we have "okay if [^"], or if \\"'' yielding [^"]|\\". Since this is allowed any number of times until we get to the closing-delimiter quote, we apply with star. That requires parentheses, so putting it all together with the leading and trailing quotes gives us "([^"]|\\")*".

Sound logical? It does to me, but unfortunately, it doesn't work for two reasons. The first concerns only a Traditional NFA. When applying our regex to the string "2\"x3\" likeness", we see that after the initial quote is matched, the first alternative is tried and matches the 2. Due to the match, the alternation is left, but is immediately revisited because of the star. Again, the first alternative is tried first, and again it is able to match (the backslash). Unfortunately, this is a problem because that backslash should be recognized as the escape for the following doublequote, not a random "non-quote'' of the [^"].

Continuing, the alternation is tried, but this time the first one fails since we're now at a doublequote. The second alternative, the escaped quote, fails as well since the escape is no longer available. The star therefore finishes, and the subsequent ending quote can match without problems. Therefore, we inadvertently match:

...you need a "2\"x3\" likeness" of yourself.

This is not a problem when using a DFA or POSIX engine, as they always find the longest match from any particular starting point, period. (I might have mentioned this once or twice before.) These engines realize that if the escape is matched by \\ and not [^"], the longer match (that we are expecting) is possible.

So, how do we solve the problem for a Traditional NFA? Well, if we switch the order of the two alternatives, \\" will be attempted before [^"] gets a chance to consume the escape, thus vaulting us past the escaped quote. So if we try "(\\"|[^"])*", it matches

...you need a "2\"x3\" likeness" of yourself.
as we want. The first alternative usually fails (requiring the regex engine to retry with the second alternative), so extra backtracking is required when matching with this pattern, but at least it works.

Unfortunately, I said that there were two problems. The second problem, affecting all engine types, involves the situation where we almost have something that we want to match, but not quite. Consider:

"someone has \"forgotten\" the closing quote

Since there is no proper closing quote, we do not want this line to match at all. But as we've seen, an escaped quote can be matched. With POSIX and DFA engines and the previous examples, the escaped quote wasn't the final quote, so the match didn't end there. However, in this case, the escaped quote is the final quote, so the match does end there. Even with a Traditional NFA engineered to first pass over an escaped quote, the search for an overall match causes it to backtrack to find it.

This shows a particularly important moral: Always consider what will happen in the "odd'' cases where you don't want your regex to match. For important situations, there is no substitute for really understanding what is happening, and for a comprehensive set of tests just in case.

Another moral: Make sure that there's no way that unwanted cases can sneak in through the back door. To get around our improper-match problem, we need to realize that both a doublequote and a backslash are "special'' in this context, and that they must be handled separately from everything else. This means that the original dot, which became [^"], now changes again: "(\\"|[^"\\])*".

Since [^"\\] no longer causes the first problem, we can re-swap the alternatives to put the most likely case first: "([^"\\]|\\")*". This doesn't make any difference to a DFA (which doesn't backtrack) or a POSIX NFA (which must try all permutations, whichever order they happen to be in), but it increases efficiency for a Traditional NFA.

Allowing for other escaped items

One practical problem is that our regex can't match "hello, world\n" because the only backslash it allows is one before a quote. The solution is simply to change \\" to \\., which leaves us with "([^"\\]|\\.)*".

If the regex flavor's dot does not match newlines, you have a problem if you want this regex to allow escaped newlines. If the regex flavor supports \n, you could use (.|\n) instead of the dot. A more efficient option (with tools that support it) is to use a character class that matches all bytes, such as [\000-\377]. Note that [.\n] is a character class that matches the two characters period and newline (or, in some tools, a character class that matches two characters, period and n, and in still others, the three characters period, backslash, and n).

Knowing Your Data and Making Assumptions

This is an opportune time to highlight a general point about constructing and using regular expressions that I've briefly mentioned a few times. It is important to be aware of the assumptions made about the kind of data with which, and situations in which, a regular expression is intended to be used. Even something as simple as a makes an assumption that the target data is in the same character encoding (=> 26) as the author intends. This is pretty much common sense, which is why I haven't been too picky about saying these things.

However, many assumptions that might seem obvious to one are not necessarily obvious to another. Our regex to match a doublequoted string, for example, assumes that there are no other doublequotes to consider. If you apply it to source code from almost any programming language, you might find, for instance, that it breaks because there can be doublequotes within comments.

There is nothing wrong with making assumptions about your data, or how you intend a regex to be used. The problems, if any, usually lie in overly optimistic assumptions and in misunderstandings between the author's intentions and how the regex is eventually used.

Additional Greedy Examples

There are certainly times when greediness works to your advantage. Let's look at a few simple examples. I'll use specific (and hopefully useful) applications to show general regular-expression construction techniques and thought processes.

With the thought that using is more interesting than reading, I'll sprinkle in a few examples coded in Perl, Tcl, and Python. If you're not interested in these particular languages, feel free to skip the code snippets. I'll use language-specific features, but still generally try to keep the examples simple and the lessons general.

Removing the leading path from a filename

The ability to manipulate filenames is often useful. An example is removing a leading path from a full pathname, such as turning /usr/local/bin/gcc into gcc.

Stating problems in a way that makes solutions amenable is half of the battle. In this case, we want to remove anything up to (and including) the final slash. If there is no slash, it is already fine and nothing needs to be done.

Here, we really do want to use .*. With the regex ^.*/, the .* consumes the whole line, but then backs off (that is, backtracks) to the last slash to achieve the match. Since a slash is our substitute command delimiter, we have to either escape it within the regex, as with s/.*\/// (the regular expression is marked), which could make one dizzy, or (if supported) use a different delimiter, say s!^.*/!!.

If you have a variable $filename, the following snippets ensure that there is no leading path:


Language Code Snippet
Perl $filename =~ s!^.*/!!;
Tcl regsub "^.*/" $filename "" filename
Python filename = regsub.sub("^.*/", "", filename)


By the way, if you work directly with DOS filenames, you would use a backslash instead of a slash. Since the backslash is a regular-expression metacharacter, you need to escape it (with another backslash). That would be something along the lines of s/^.*\\//. But with Tcl, Python, and other languages where regular expressions are just normal strings passed to regex-handling functions, backslash is not only a regex metacharacter, but it's often a string metacharacter as well. This means that you need \\ just to get one backslash into the regex, and since you need two in the regex, you end up needing \\\\ to match a literal backslash. Wow.

Remember this key point: Always consider what will happen if there is no match. In this case, if there is no slash in the string, no substitution is done and the string is left as is. Great, that's just what we want... a string such as `/bin/sh' becomes `sh', `//../ernst' becomes `ernst', and `vi' stays just as it is.

For efficiency's sake, it's important to remember how the regex engine goes about its work (if it is NFA-based, that is). Let's consider what happens if we omit the leading caret (something that's easy to do) and match against a string without a slash. As always, the regex engine starts the search at the beginning of the string. The .* races to the end of the string, but must back off to find a match for the slash. It eventually backs off everything that .* had gobbled up, yet there's still no match. So, the regex engine decides that there is no possible match when starting from the beginning of the string, but it's not done yet!

The transmission kicks in and retries the whole regex from the second character position. In fact, it needs (in theory) to go through the whole scan-and-backtrack routine for each possible starting position in the string. Filenames tend to be short, but the principle applies to many situations, and were the string long, that is potentially a lot of backtracking. Again, a DFA has no such problem.

In practice, a reasonable transmission realizes that any regex starting with .* that fails at the beginning of the string will never match when started from anywhere else, so it can shift gears and attempt the regex only the one time at the start of the string. Still, it's smarter to write that into our regex in the first place, as we originally did.

Accessing the filename from a path

A related option is to bypass the path and simply match the trailing filename part without the path, putting that text into another variable. The final filename is everything at the end that's not a slash: [^/]*$. This time, the anchor is not just an optimization; we really do need dollar at the end. We can now do something like:
$WholePath =~ m!([^/]*)$!; # check variable $WholePath with regex.
$FileName = $1; # note text matched

You'll notice that I don't check to see whether the regex actually matches or not, because I know it will match every time. The only requirement of that expression is that the string have an end to match dollar, and even an empty string has an end.


Thus, when I use $1 to reference the text matched within the parenthetical subexpression, I'm assured it will have some (although possibly empty) value.19

19  If you're familiar with Perl, you might wonder why I used parentheses and $1 instead of just using $& (a variable representing the overall text matched). The reason is that there's a potentially large performance hit for a program that uses $& -- see "Unsociable $& and Friends'' (=> 273).

However, since Tcl, Python, and Perl all use NFAs (Traditional NFAs, to be specific), [^\/]*$ is very inefficient. Carefully run through how the NFA engine attempts the match and you see that it can involve a lot of backtracking. Even the short sample `/usr/local/bin/perl' backtracks over 40 times before finally matching.

Consider the attempt that starts at ...|local/.... Once [^/]* matches through to the second l and fails on the slash, the $ is tried (and fails) for each l, a, c, o, l saved state. If that's not enough, most of it is repeated with the attempt that starts at ...l|ocal/..., and then again ...lo|cal/..., and so on.

It shouldn't concern us too much with this particular example, as filenames tend to be short. (And 40 backtracks is nothing -- 40 million is when they really matter!) Again, since it's important to be aware of the issues, the general lessons here can be applied to your specific needs.

This is a good time to point out that even in a book about regular expressions, regular expressions aren't always The Answer. Tcl, for example, provides special commands to pick apart pathnames (part of the file command set). In Perl,

$name = substr($WholePath, rindex($WholePath, "/")+1);
is much more efficient. However, for the sake of discussion, I'll forge ahead.

Both leading path and filename

The next logical step is to pick apart a full path into both its leading path and filename component. There are many ways to do this, depending on what we want. Initially, you might want to use ^(.*)/(.*)$ to fill $1 and $2 with the requisite parts. It looks like a nicely balanced regular expression, but knowing how greediness works, we are guaranteed that the first .* will never leave anything with a slash for $2. The only reason the first .* leaves anything at all is due to the backtracking done in trying to match the slash that follows. This leaves only that "backtracked'' part for the later .*. Thus, $1 will be the full leading path and $2 the trailing filename.

One thing to note: we are relying on the initial (.*)/ to ensure that the second (.*) does not capture any slash. We understand greediness, so this is okay. Still I like to be specific when I can, so I'd rather use [^/]* for the filename part. That gives us ^(.*)/([^/]*)$. Since it shows exactly what we want, it acts as documentation as well.

One big problem is that this regex requires at least one slash in the string, so if we try it on something like file.txt, there's no match, and no information. This can be a feature if we deal with it properly:

if ( $WholePath =~ m!^(.*)/(.*)$! ) {
    $LeadingPath = $1;
    $FileName = $2;
} else {
    $LeadingPath = ".";  # so "file.txt" looks like ". / file.txt"
    $FileName = $WholePath;
}

Another method for getting at both components is to use either method for accessing the path or file, and then use the side effects of the match to construct the other. Consider the following Tcl snippet which finds the location of the last slash, then plucks the substrings from either side:

 
if [regexp -indices .*/ $WholePath Match] {
   # We have a match. Use the index to the end of the match to find the slash.
   set LeadingPath [string range $WholePath 0 [expr [lindex $Match 1] -1]]
   set FileName [string range $WholePath [expr [lindex $Match 1] +1] end]
} {
   # No match - whole name is the filename.
   set LeadingPath .
   set FileName $WholePath
}

Here, we use Tcl's regexp -indices feature to get the index into WholePath of the match: if the string is /tmp/file.txt, the variable Match gets `0·4' to reflect that the match spanned from character 0 to character 4. We know the second index points to the slash, so we use [expr [lindex $Match 1] - 1] to point just before it, and a +1 version to point just after it. We then use string range to pluck the two substrings.

Again, with this particular example, using a regex to find the last slash is sort of silly -- some kind of an rindex function would be faster (in Tcl, it would be string last / $WholePath). Still, the idea of plucking out parts of the string this way is intriguing, even if this simple example can be done differently.

Summary

If you understood everything in this chapter the first time reading it, you probably didn't need to read it in the first place. It's heady stuff, to say the least. It took me quite a while to understand it, and then longer still to understand it. I hope this one concise presentation makes it easier for you. I've tried to keep the explanation simple without falling into the trap of oversimplification (an unfortunately all-too-common occurrence which hinders a real understanding).

This chapter is divided roughly into two parts -- a description of match mechanics, and some of their practical effects.

Match Mechanics Summary

There are two underlying technologies commonly used to implement a regex match engine, "regex-directed NFA'' (=> 99) and "text-directed DFA'' (=> 100) [abbreviations spelled out => 101].

Combine the two technologies with the POSIX standard (=> 116), and for practical purposes there are three types of engines:

  · Traditional NFA (gas-guzzling, power-on-demand) engine
  · POSIX NFA (gas-guzzling, standard-compliant) engine
  · DFA (POSIX or not) (electric, steady-as-she-goes) engine

To get the most out of a utility, you need to understand which type of engine it uses, and craft your regular expressions appropriately. The most common type is the Traditional NFA, followed by the DFA. Table 4-1 (=> 90) lists a few common tools and their engine type. In Chapter 5's "Testing the Engine Type'' (=> 160), I show how you can test for the engine type yourself.

One overriding rule regardless of engine type: matches starting sooner take precedence over matches starting later. This is due to how the engine's "transmission'' tests the regex at each point in the string (=> 92).


For the match attempt starting at any given spot:

 DFA Text-Directed Engines
Find the longest possible match, period. That's it. End of discussion (=> 115). Consistent, Very Fast, and Boring to talk about (=> 118).

 NFA Regex-Directed Engines
Must "work through'' a match. The soul of NFA matching is backtracking (=> 102, 106). The metacharacters control the match: the quantifiers (star and friends) are greedy (=> 94). Alternation is not usually greedy (=> 112), but is with a POSIX NFA.

POSIX NFA Must find the longest match, period. But it's not boring, as you must worry about efficiency (the subject of the next chapter).

Traditional NFA Can be the most expressive regex engine, since you can use the regex-directed nature of the engine to craft exactly the match you want.

"DFA and NFA in Comparison'' on page 118 summarizes the differences among the engine types.

We never did find out the regex counterpart to Riiich Coriiiiinthian Leaaaather.

Some Practical Effects of Match Mechanics

Matching delimited text, such as doublequoted strings or C comments is a common task (=> 129). The general approach is to match the opening delimiter, then anything not the closing delimiter, and finally the closing delimiter. The difficulty usually lies in ensuring that the closing delimiter doesn't sneak into the middle phase. Be sure to understand how persistent greediness can be (=> 110).

Greediness is your friend if used skillfully, but it can lead to pitfalls if you're not careful. It's a good idea to be as specific as you can (=> 122), and to pay careful attention to how unwanted matches might sneak in (=> 127).

Constructing a regex for a specific task often requires striking a balance among matching what you want, not matching what you don't want, and (for an NFA) being efficient (=> 121). For an NFA, efficiency is so crucial that I devote the next chapter to crafting an efficient NFA regular expression.

Back to: Mastering Regular Expressions


oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy

© 2001, O'Reilly & Associates, Inc.