O'Reilly    
 Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples


An Incredibly Brief Guide to Regular Expressions: Appendix C - Learning Rails

by Edd Dumbill, Simon St. Laurent

Ruby, like many other languages, contains a powerful text-processing shortcut that looks like it was created by cats walking on the keyboard. Regular expressions can be very difficult to read, especially as they grow longer, but they offer tremendous power that’s hard to re-create in Ruby code. As long as you stay within a modest subset of regular expressions, you can get a lot done without confusing anyone—yourself included—who’s trying to make sense out of your program logic.

Learning Rails book cover

This excerpt is from Learning Rails . Most Rails books are written for programmers looking for information on data structures. Learning Rails targets web developers whose programming experience is tied directly to the Web. Rather than begin with the inner layers of a Rails web application -- the models and controllers -- this unique book approaches Rails development from the outer layer: the application interface. You can start from the foundations of web design you already know, and then move more deeply into Ruby, objects, and database structures.

buy button

For a much more comprehensive guide to regular expressions, see Jeffrey E. F. Friedl’s classic Mastering Regular Expressions (O’Reilly) or Tony Stubblebine’s compact but extensive Regular Expression Pocket Reference (O’Reilly).

What Regular Expressions Do

Regular expressions help your programs find chunks of text that match patterns you specify. Depending on how you call the regular expression, you may get:

A yes/no answer

Something matched or it didn’t

A set of matches

All of the pieces that matched your query, so you can sort through them

A new string

If you specified that this was a search-and-replace operation, you may have a new string with all of the replacements made

Regular expressions also offer incredible flexibility in specifying search terms. A key part of the reason that regular expressions look so arcane is that they use symbols to specify different kinds of matches, and matches on characters that aren’t easily typed.

Starting Small

The most likely place that you’re going to use regular expressions in Rails is the validates_format_of method demonstrated in Chapter 7, Strengthening Models with Validation, which is shown here as Example C.1, “Validating data against regular expressions”.

Example C.1. Validating data against regular expressions

# ensure secret contains at least one number
  validates_format_of :secret, :with => /[0-9]/,
    :message => "must contain at least one number"

# ensure secret contains at least one upper case
  validates_format_of :secret, :with => /[A-Z]/,
    :message => "must contain at least one upper case character"

# ensure secret contains at least one lower case
  validates_format_of :secret, :with => /[a-z]/,
    :message => "must contain at least one lower case character"

These samples all use regular expressions in their simplest typical use case: testing to see whether a string contains a pattern. Each of these will test :secret against the expression specified by :with. If the pattern in :with matches, then validation passes. If not, then validation fails and the :message will be returned. Removing the Rails trim, the first of these could be stated roughly in Ruby as:

if :secret =~ /[0-9]/
  #yes, it's there
else
  #no, it's not
end

The =~ is Ruby’s way of declaring that the test is going to compare the contents of the left operand against the regular expression on the right side. It doesn’t actually return true or false, though—it returns the numeric position at which the first match begins, if there is a match, and nil if there are none. You can treat it as a boolean evaluator, however, because nil always behaves as false in a boolean evaluation, and other non-false values are the same as true.

Note

There isn’t room here to explain them, but if you need to do more with regular expressions than just testing whether there’s a match, you’ll be interested in the $~ variable (or Regexp.last_match), which gives you access to more detail on the results of the matching. A variety of methods on the String object, notably sub, gsub, and slice, also use regular expressions for slicing and dicing. You can also retrieve match results with $1 for the first match, $2 for the second, and so on, variables created by the match.

There’s one other feature in these simple examples worth a little more depth. Reading them, you might have thought that /[0-9]/ was a regular expression. It’s a regular expression object, but the expression itself is [0-9]. Ruby uses the forward slash as a delimiter for regular expressions, much like quotes are used for strings. Unlike strings, though, you can add flags after the closing slash, as you’ll see later.

If you’d prefer, you can also use Regexp.new to create regular expression objects. (This usually makes sense if your code needs to meet changing circumstances on the fly at runtime.)

The Simplest Expressions: Literal Strings

The simplest regular expressions are simply literal strings. There are plenty of times when it’s enough to search against a fixed search pattern. For example, you might test for the presence of the string “Ruby”:

sentence = "Ruby is the best Ruby-like programming language."
sentence =~ /Ruby/
# => 2  - There are two instances of "Ruby".

Character Classes

Example C.1, “Validating data against regular expressions” tested against letters and numbers, but there are many ways to do that. [a-z] is a good way to test for lowercase letters in English, but many languages use characters outside of that range. Regular expression character classes let you create sets of characters as well as use predefined groups of characters to identify what you want to target.

To create your own character class, use the square braces: [ and ]. Within the square braces, you can either list the characters you want, or create a set of characters with the hyphen. To match all the (guaranteed) English vowels in lowercase, you would write:

/[aeiou]/

If you wanted to match both upper- and lowercase vowels, you could write:

/[aeiouAEIOU]/

(If you wanted to ignore case entirely in your search, you could also use the i modifier described earlier: /[aeiou]/i.)

You can also mix character classes in with other parts of a search:

/[Rr][aeiou]by/

That would match Ruby, ruby, raby, roby, and a lot of other variations with upper- or lowercase R, followed by a lowercase vowel, followed by by.

Sometimes listing all the characters in a class is a hassle. Regular expressions are difficult enough to read without huge chunks of characters in classes. So instead of:

/[abcdefghijklmnopqrstuvwxyz]/

you can just write:

/[a-z]/

As long as the characters you want to match form a single range, that’s simple—the hyphen just means “everything in between.”

There’s also a “not” option available, in the ^ character. You can reverse /[aeiou]/ by writing:

/^[aeiou]/

Regular expressions also offer built-in character classes, listed in Table C.1, Regular expression special character classes”, that can make regular expressions more readable—at least, more readable once you’ve learned what they mean.

Table C.1. Regular expression special character classes

Syntax

Meaning

.

Match any character. (Without the m modifier, it doesn’t match newlines; with the m modifier, it does.)

\d

Matches any digit. (Just 0–9, not other Unicode digits.)

\D

Matches any nondigit.

\s

Matches whitespace characters: tab, carriage return, newline, form feed.

\S

Matches nonwhitespace characters.

\w

Matches word characters: A–Z, a–z, and 0–9.

\W

Matches all nonword characters.


Escaping

Of course, even in simple strings there can be a large problem: lots of characters you’ll want to test for are used by regular expression engines with a different meaning. The square braces around [0-9] are helpful for specifying that it’s a set starting with zero and going to nine, but what if you’re actually searching for square braces?

Fortunately, you can “escape” any character that regular expressions use for something else by putting a backslash in front of it. An expression that looks for left square brackets would look like \[. If you need to include a backslash, just put a second backslash in front of it, as in \\.

Some characters, particularly whitespace characters, are also just difficult to represent in a string without creating strange formatting. Table C.2, Escapes for whitespace characters” shows how to escape them for convenient matching.

Table C.2. Escapes for whitespace characters

Escape sequence

Meaning

\f

Form feed character

\n

Newline character

\r

Carriage return character

\t

Tab character


Modifiers

Sometimes you want to be able to search for strings without regard to case, and you don’t want to put a lot of effort into creating an expression that covers every option. Other times you want to search against a string that contains many lines of text, and you don’t want the expression to stop at the first line. For these situations, where the underlying rules change, Ruby supports modifiers, which you can put at the end of the expression or specify through the Regexp object. A complete list of modifiers is shown in Table C.3, Regular expression modifier options”.

Table C.3. Regular expression modifier options

Modifier character

Effect

i

Ignore case completely.

m

Multiline matching—look past the first newline, and allow . and \n to match newline characters.

x

Use extended syntax, allowing whitespace and comments in expressions. (Probably not the first thing you want to try!)

o

Only interpolate #{} expressions the first time the regular expression is evaluated. (Again, unlikely when starting out.)

u

Treat the content of the regular expression as Unicode. (By default, it is treated as the same as the content it is tested against.)

e, s, n

Treat the content of the regular expression as EUC, SJIS, and ASCII, respectively, like u does for Unicode.


Of these, i and m are the only ones you’re likely to use at the beginning. To use them in a regular expression literal, just add them after the closing \:

sentence = "I think Ruby is the best Ruby-like programming language."
sentence =~ /ruby/i
# => 8  - "ruby" first appears at character 8.

If you want to use multiple options, you can. /ruby/iu specifies case-insensitive Unicode matching, for instance.

Anchors

Sometimes you want a match to be meaningful only at an edge: the start or the end, or maybe a word in the middle. You might even want to define your own edge—something is important only when it’s next to something else. Ruby’s regular expression engine lets you do all of these things, as well as match only when your match is not against an edge. Table C.4, Regular expression anchors” lists common anchor syntax.

Table C.4. Regular expression anchors

Syntax

Meaning

^

When at the start of the expression, means to match the expression only against the start of the target (or a line within the target, when multiline matching is on).

$

When at the end of the expression, means to match the expression only against the end of the target (or the end of a line within the target, when multiline matching is on).

\A

When at the start of the expression, means to match the expression only against the start of the target string, not lines within it.

\Z

When at the end of the expression, means to match the expression only against the end of the target string, not lines within it.

\b

Marks a boundary between words, up against whitespace.

\B

Marks something that isn’t a boundary between words.

(?=expression)

Lets you define your own boundary, by limiting the match to things next to expression.

(?!expression)

Lets you define your own boundary, by limiting the match to things that are not next to expression.


These make a little more sense if you see them in action. For example, if you only want to match “The” when it’s at the start of a line, you could write:

/^The/

If you wanted to match “1991” when it’s at the end of a line, you could write:

/1991$/

If multiline matching was on, and you wanted to make sure these matches apply only at the start or end of the string, you would write them as:

/\AThe/
/1991\Z/

The \b anchor is really useful when you want to match a word, not places where a sequence falls in the middle of a word. For example, if you wanted to match “the” without matching “Athens” or “Promethean,” you could write:

/\bthe\b/

Alternately, if you wanted to match “the” only when it was part of another word, you could use \B to write:

/\Bthe\B/

The last two items in Table C.4, Regular expression anchors” let you specify boundaries of your own—not just whitespace or the start or end, but any characters you want.

Sequences, Repetition, Groups, and Choices

Specifying a simple match pattern may take care of most of what you need regular expressions for use in Rails, but there are a few additional pieces you should know about before moving on. Even if you don’t match something that needs these, knowing what they look like will help you read other regular expressions when you encounter them.

There are three classic symbols that indicate whether an item is optional or can repeat, plus a notation that lets you specify how much something should repeat, as shown in Table C.5, Options and repetition”.

Table C.5. Options and repetition

Syntax

Meaning

?

The pattern right before it should appear 0 or 1 times.

*

The pattern right before it should appear 0 or more times.

+

The pattern right before it should appear 1 or more times.

{number}

The pattern before the opening curly brace should appear exactly number times.

{number,}

The pattern before the opening curly brace should appear at least number times.

{number1, number2}

The pattern before the opening curly brace should appear at least number1 times but no more than number2 times.


You might think you’re ready to go create expressions armed with this knowledge, but you’ll find some unpleasant surprises. The regular expression:

/1998+/

might look like it will match one or more instances of “1998”, but it will actually match “199” followed by one or more instances of “8”. To make it match a sequence of 1998s, you would write:

/(1998)+/

If you wanted to specify, say, two to five occurrences of 1998, you’d write:

/(1998){2,5}/

The parentheses can also be helpful when specifying choices, though for a slightly different reason. If you wanted to match, say, 2013 or 2014, you could use | to write:

/2013|2014/

The | divides the whole expression into complete expressions to its left or right, rather than just grabbing the previous character, so you don’t need parentheses around either 2013 or 2014. Nonetheless, if you wanted to do some thing like match 2013, 2014, or 2017, you might not want to write:

/2013|2014|2017/

You could instead write something more like:

/201(3|4|7)/

Note

Parentheses also “capture” matched text for later use, and that capturing may determine how you structure parentheses. It’s probably not the first place you’ll want to start, though.

Greed

There’s one last feature of the repetition operators that can cause unexpected results: by default, they’re greedy. This isn’t a question of computing virtue, but rather one of how much content a regular expression can match at one go. This is a common issue in things like HTML, where you might see something like:

<a href= "http://example.com" >Example.com</a>

You might think you could match the HTML tags simply with an expression like:

/<.*>/

But instead of matching the opening tag and closing tag separately, that expression will grab everything from the opening < to the closing > of </a>, because it can. If you want to restrain a given expression so that it takes the smallest possible matching bite, add a ? behind any of the repetition operators:

/<.*?>/

Greed matters more when you use regular expressions to extract content from long strings, but it can yield confusing results even in supposedly simple matching. If you have mysterious problems, greed is a good thing to check for.

More

Regular expressions have nearly infinite depth, and this appendix has barely begun to scratch the surface, either of expressions or the ways you can use them in Ruby and Rails. A few of the things this incredibly brief guide hasn’t been able to include are:

  • Using expressions to fragment a string into smaller pieces

  • Referencing earlier matches later in an expression

  • Creating named groups

  • Commenting regular expressions

  • A variety of special syntax forms using parentheses

Again, for a much more comprehensive guide to regular expressions, see Jeffrey E. F. Friedl’s classic Mastering Regular Expressions or Tony Stubblebine’s compact but extensive Regular Expression Pocket Reference. For more on using them specifically with Ruby, see The Ruby Programming Language, by David Flanagan and Yukihiro Matsumoto (O’Reilly).

If you enjoyed this excerpt, buy a copy of Learning Rails .

Copyright © 2009 O'Reilly Media, Inc.