Ruby, like many other languages, contains a powerful text-processing shortcut that looks like it was created by cats walking on the keyboard. Regular expressions can be very difficult to read, especially as they grow longer, but they offer tremendous power that’s hard to re-create in Ruby code. As long as you stay within a modest subset of regular expressions, you can get a lot done without confusing anyone—yourself included—who’s trying to make sense out of your program logic.
This excerpt is from Learning Rails . Most Rails books are written for programmers looking for information on data structures. Learning Rails targets web developers whose programming experience is tied directly to the Web. Rather than begin with the inner layers of a Rails web application -- the models and controllers -- this unique book approaches Rails development from the outer layer: the application interface. You can start from the foundations of web design you already know, and then move more deeply into Ruby, objects, and database structures.
For a much more comprehensive guide to regular expressions, see Jeffrey E. F. Friedl’s classic Mastering Regular Expressions (O’Reilly) or Tony Stubblebine’s compact but extensive Regular Expression Pocket Reference (O’Reilly).
Regular expressions help your programs find chunks of text that match patterns you specify. Depending on how you call the regular expression, you may get:
Something matched or it didn’t
All of the pieces that matched your query, so you can sort through them
If you specified that this was a search-and-replace operation, you may have a new string with all of the replacements made
Regular expressions also offer incredible flexibility in specifying search terms. A key part of the reason that regular expressions look so arcane is that they use symbols to specify different kinds of matches, and matches on characters that aren’t easily typed.
The most likely place that you’re going to use regular expressions
in Rails is the validates_format_of method demonstrated
in Chapter 7, Strengthening Models with Validation, which is
shown here as Example C.1, “Validating data against regular expressions”.
Example C.1. Validating data against regular expressions
# ensure secret contains at least one number validates_format_of :secret,:with => /[0-9]/, :message => "must contain at least one number" # ensure secret contains at least one upper case validates_format_of :secret,:with => /[A-Z]/, :message => "must contain at least one upper case character" # ensure secret contains at least one lower case validates_format_of :secret,:with => /[a-z]/, :message => "must contain at least one lower case character"
These samples all use regular expressions in their simplest
typical use case: testing to see whether a string contains a pattern.
Each of these will test :secret
against the expression specified by :with. If the pattern in :with matches, then validation passes. If not,
then validation fails and the :message will be returned. Removing the Rails
trim, the first of these could be stated roughly in Ruby as:
if :secret =~ /[0-9]/
#yes, it's there
else
#no, it's not
endThe =~ is Ruby’s way of
declaring that the test is going to compare the contents of the left
operand against the regular expression on the right side. It doesn’t
actually return true or false, though—it returns the numeric position
at which the first match begins, if there is a match, and nil if there are none. You can treat it as a
boolean evaluator, however, because nil always behaves as false in a boolean evaluation, and other
non-false values are the same as
true.
There isn’t room here to explain them, but if you need to do
more with regular expressions than just testing whether there’s a
match, you’ll be interested in the $~ variable (or Regexp.last_match), which gives you access
to more detail on the results of the matching. A variety of methods on
the String object, notably sub, gsub, and slice, also use regular expressions for
slicing and dicing. You can also retrieve match results with $1 for the first match, $2 for the second, and so on, variables
created by the match.
There’s one other feature in these simple examples worth a little
more depth. Reading them, you might have thought that /[0-9]/ was a regular expression. It’s a
regular expression object, but the expression itself is [0-9]. Ruby uses the forward slash as a
delimiter for regular expressions, much like quotes are used for
strings. Unlike strings, though, you can add flags after the closing
slash, as you’ll see later.
If you’d prefer, you can also use Regexp.new to create regular expression
objects. (This usually makes sense if your code needs to meet changing
circumstances on the fly at runtime.)
The simplest regular expressions are simply literal strings. There are plenty of times when it’s enough to search against a fixed search pattern. For example, you might test for the presence of the string “Ruby”:
sentence = "Ruby is the best Ruby-like programming language." sentence =~ /Ruby/ # => 2 - There are two instances of "Ruby".
Example C.1, “Validating data against regular expressions”
tested against letters and numbers, but there are many
ways to do that. [a-z] is a good way
to test for lowercase letters in English, but many languages use
characters outside of that range. Regular expression character classes
let you create sets of characters as well as use predefined groups of
characters to identify what you want to target.
To create your own character class, use the square braces:
[ and ]. Within the square braces, you can either
list the characters you want, or create a set of characters with the
hyphen. To match all the (guaranteed) English vowels in lowercase, you
would write:
/[aeiou]/
If you wanted to match both upper- and lowercase vowels, you could write:
/[aeiouAEIOU]/
(If you wanted to ignore case entirely in your search, you could
also use the i modifier described
earlier: /[aeiou]/i.)
You can also mix character classes in with other parts of a search:
/[Rr][aeiou]by/
That would match Ruby, ruby, raby,
roby, and a lot of other variations
with upper- or lowercase R, followed
by a lowercase vowel, followed by by.
Sometimes listing all the characters in a class is a hassle. Regular expressions are difficult enough to read without huge chunks of characters in classes. So instead of:
/[abcdefghijklmnopqrstuvwxyz]/
you can just write:
/[a-z]/
As long as the characters you want to match form a single range, that’s simple—the hyphen just means “everything in between.”
There’s also a “not” option available, in the ^ character. You can reverse /[aeiou]/ by writing:
/^[aeiou]/
Regular expressions also offer built-in character classes, listed in Table C.1, Regular expression special character classes”, that can make regular expressions more readable—at least, more readable once you’ve learned what they mean.
Table C.1. Regular expression special character classes
Syntax | Meaning |
|---|---|
| Match any character. (Without the |
| Matches any digit. (Just 0–9, not other Unicode digits.) |
| Matches any nondigit. |
| Matches whitespace characters: tab, carriage return, newline, form feed. |
| Matches nonwhitespace characters. |
| Matches word characters: A–Z, a–z, and 0–9. |
| Matches all nonword characters. |
Of course, even in simple strings there can be a large
problem: lots of characters you’ll want to test for are used by regular
expression engines with a different meaning. The square braces around
[0-9] are helpful for specifying that
it’s a set starting with zero and going to nine, but what if you’re
actually searching for square braces?
Fortunately, you can “escape” any character that regular
expressions use for something else by putting a backslash in front of
it. An expression that looks for left square brackets would look like
\[. If you need to include a
backslash, just put a second backslash in front of it, as in \\.
Some characters, particularly whitespace characters, are also just difficult to represent in a string without creating strange formatting. Table C.2, Escapes for whitespace characters” shows how to escape them for convenient matching.
Table C.2. Escapes for whitespace characters
Escape sequence | Meaning |
|---|---|
| Form feed character |
| Newline character |
| Carriage return character |
| Tab character |
Sometimes you want to be able to search for strings without regard
to case, and you don’t want to put a lot of effort into creating an
expression that covers every option. Other times you want to search
against a string that contains many lines of text, and you don’t want
the expression to stop at the first line. For these situations, where
the underlying rules change, Ruby supports modifiers, which you can put
at the end of the expression or specify through the Regexp object. A complete list of modifiers is
shown in Table C.3, Regular expression modifier options”.
Table C.3. Regular expression modifier options
Modifier character | Effect |
|---|---|
| Ignore case completely. |
| Multiline matching—look past the first newline, and
allow |
| Use extended syntax, allowing whitespace and comments in expressions. (Probably not the first thing you want to try!) |
| Only interpolate |
| Treat the content of the regular expression as Unicode. (By default, it is treated as the same as the content it is tested against.) |
| Treat the content of the regular expression as EUC,
SJIS, and ASCII, respectively, like |
Of these, i and m are the only ones you’re likely to use at
the beginning. To use them in a regular expression literal, just add
them after the closing \:
sentence = "I think Ruby is the best Ruby-like programming language." sentence =~ /ruby/i # => 8 - "ruby" first appears at character 8.
If you want to use multiple options, you can. /ruby/iu specifies case-insensitive Unicode matching, for instance.
Sometimes you want a match to be meaningful only at an edge: the start or the end, or maybe a word in the middle. You might even want to define your own edge—something is important only when it’s next to something else. Ruby’s regular expression engine lets you do all of these things, as well as match only when your match is not against an edge. Table C.4, Regular expression anchors” lists common anchor syntax.
Table C.4. Regular expression anchors
Syntax | Meaning |
|---|---|
| When at the start of the expression, means to match the expression only against the start of the target (or a line within the target, when multiline matching is on). |
| When at the end of the expression, means to match the expression only against the end of the target (or the end of a line within the target, when multiline matching is on). |
| When at the start of the expression, means to match the expression only against the start of the target string, not lines within it. |
| When at the end of the expression, means to match the expression only against the end of the target string, not lines within it. |
| Marks a boundary between words, up against whitespace. |
| Marks something that isn’t a boundary between words. |
| Lets you define your own boundary, by limiting the
match to things next to |
| Lets you define your own boundary, by limiting the
match to things that are not next to
|
These make a little more sense if you see them in action. For example, if you only want to match “The” when it’s at the start of a line, you could write:
/^The/
If you wanted to match “1991” when it’s at the end of a line, you could write:
/1991$/
If multiline matching was on, and you wanted to make sure these matches apply only at the start or end of the string, you would write them as:
/\AThe/ /1991\Z/
The \b anchor is really useful
when you want to match a word, not places where a sequence falls in the
middle of a word. For example, if you wanted to match “the” without
matching “Athens” or “Promethean,” you could write:
/\bthe\b/
Alternately, if you wanted to match “the”
only when it was part of another word, you could
use \B to write:
/\Bthe\B/
The last two items in Table C.4, Regular expression anchors” let you specify boundaries of your own—not just whitespace or the start or end, but any characters you want.
Specifying a simple match pattern may take care of most of what you need regular expressions for use in Rails, but there are a few additional pieces you should know about before moving on. Even if you don’t match something that needs these, knowing what they look like will help you read other regular expressions when you encounter them.
There are three classic symbols that indicate whether an item is optional or can repeat, plus a notation that lets you specify how much something should repeat, as shown in Table C.5, Options and repetition”.
Table C.5. Options and repetition
Syntax | Meaning |
|---|---|
| The pattern right before it should appear 0 or 1 times. |
| The pattern right before it should appear 0 or more times. |
| The pattern right before it should appear 1 or more times. |
| The pattern before the opening curly brace should
appear exactly |
| The pattern before the opening curly brace should
appear at least |
| The pattern before the opening curly brace should
appear at least |
You might think you’re ready to go create expressions armed with this knowledge, but you’ll find some unpleasant surprises. The regular expression:
/1998+/
might look like it will match one or more instances of “1998”, but it will actually match “199” followed by one or more instances of “8”. To make it match a sequence of 1998s, you would write:
/(1998)+/
If you wanted to specify, say, two to five occurrences of 1998, you’d write:
/(1998){2,5}/The parentheses can also be helpful when specifying choices,
though for a slightly different reason. If you wanted to match, say,
2013 or 2014, you could use | to
write:
/2013|2014/
The | divides the whole
expression into complete expressions to its left or right, rather than
just grabbing the previous character, so you don’t need parentheses
around either 2013 or 2014. Nonetheless, if you wanted to do some thing
like match 2013, 2014, or 2017, you might not want to write:
/2013|2014|2017/
You could instead write something more like:
/201(3|4|7)/
There’s one last feature of the repetition operators that can cause unexpected results: by default, they’re greedy. This isn’t a question of computing virtue, but rather one of how much content a regular expression can match at one go. This is a common issue in things like HTML, where you might see something like:
<a href= "http://example.com" >Example.com</a>
You might think you could match the HTML tags simply with an expression like:
/<.*>/
But instead of matching the opening tag and closing tag
separately, that expression will grab everything from the opening
< to the closing > of </a>, because it can. If you want to
restrain a given expression so that it takes the smallest possible
matching bite, add a ? behind any of
the repetition operators:
/<.*?>/
Greed matters more when you use regular expressions to extract content from long strings, but it can yield confusing results even in supposedly simple matching. If you have mysterious problems, greed is a good thing to check for.
Regular expressions have nearly infinite depth, and this appendix has barely begun to scratch the surface, either of expressions or the ways you can use them in Ruby and Rails. A few of the things this incredibly brief guide hasn’t been able to include are:
Using expressions to fragment a string into smaller pieces
Referencing earlier matches later in an expression
Creating named groups
Commenting regular expressions
A variety of special syntax forms using parentheses
Again, for a much more comprehensive guide to regular expressions, see Jeffrey E. F. Friedl’s classic Mastering Regular Expressions or Tony Stubblebine’s compact but extensive Regular Expression Pocket Reference. For more on using them specifically with Ruby, see The Ruby Programming Language, by David Flanagan and Yukihiro Matsumoto (O’Reilly).
If you enjoyed this excerpt, buy a copy of Learning Rails .
Copyright © 2009 O'Reilly Media, Inc.