Chapter 2. Using Regular Expressions
2.0. Introduction
Regular expressions are search patterns that can be used to find
text that matches a given pattern. For instance, in the last chapter, we
looked for the substring Cookbook
within a longer
string:
var testValue = "This is the Cookbook's test string"; var subsValue = "Cookbook"; var iValue = testValue(subsValue); // returns value of 12, index of substring
This code snippet worked because we were looking for an exact match. But what if we wanted a more general search? For instance, we want to search for the words Cook and Book, in strings such as “Joe’s Cooking Book” or “JavaScript Cookbook”?
When we’re looking for strings that match a pattern rather than an
exact string, we need to use regular expressions. We can try to make do
with String
functions, but in the end, it’s actually
simpler to use regular expressions, though the syntax and format is a
little odd and not necessarily “user friendly.”
Recently, I was looking at code that pulled the RGB values from a
string, in order to convert the color to its hexadecimal format. We’re
tempted to just use the String.split
function, and split on the commas, but then you have to strip out the
parentheses and extraneous whitespace. Another consideration is how can we
be sure that the values are in octal format? Rather than:
rgb (255, 0, 0)
we might find:
rgb (100%, 0, 0)
There’s an additional problem: some browsers return a color, such as a background color, as an RGB value, others as a hexadecimal. You need to be able to handle both when building a consistent conversion routine.
In the end, it’s a set of regular expressions that enable us to solve what, at first, seems to be a trivial problem, but ends up being much more complicated. In an example from the popular jQuery UI library, regular expressions are used to match color values—a complicated task because the color values can take on many different formats, as this portion of the routine demonstrates:
// Look for #a0b1c2 if (result = /#([a-fA-F0-9]{2})([a-fA-F0-9]{2})([a-fA-F0-9]{2})/.exec(color)) return [parseInt(result[1],16), parseInt(result[2],16), parseInt(result[3],16)]; // Look for #fff if (result = /#([a-fA-F0-9])([a-fA-F0-9])([a-fA-F0-9])/.exec(color)) return [parseInt(result[1]+result[1],16), parseInt(result[2]+result[2],16), parseInt(result[3]+result[3],16)]; // Look for rgba(0, 0, 0, 0) == transparent in Safari 3 if (result = /rgba\(0, 0, 0, 0\)/.exec(color)) return colors['transparent']; // Otherwise, we're most likely dealing with a named color return colors[$.trim(color).toLowerCase()];
Though the regular expressions seem complex, they’re really nothing
more than a way to describe a pattern. In JavaScript, regular expressions
are managed through the RegExp
object.
A RegExp Literal
As with String
in Chapter 1,
RegExp
can be both a literal and an
object. To create a RegExp
literal,
you use the following syntax:
var re = /regular expression/;
The regular expression pattern is contained between opening and closing forward slashes. Note that this pattern is not a string: you do not want to use single or double quotes around the pattern, unless the quotes themselves are part of the pattern to match.
Regular expressions are made up of characters, either alone or in
combination with special characters, that provide for more complex
matching. For instance, the following is a regular expression for a
pattern that matches against a string that contains the word Shelley
and the word Powers
, in that order, and separated by one or
more whitespace characters:
var re = /Shelley\s+Powers/;
The special characters in this example are the backslash character (\
), which has two purposes: either it’s used
with a regular character, to designate that it’s a special character; or
it’s used with a special character, such as the plus sign (+
), to designate that the character should be
treated literally. In this case, the backslash is used with “s”, which
transforms the letter s to a special character designating a whitespace
character, such as a space, tab, line feed, or form feed. The \s
special character is followed by the plus sign, \s+
, which is a signal to match the preceding
character (in this example, a whitespace character) one or more times.
This regular expression would work with the following:
Shelley Powers
It would also work with the following:
Shelley Powers
It would not work with:
ShelleyPowers
It doesn’t matter how much whitespace is between
Shelley and Powers, because of
the use of \s+
. However, the use of
the plus sign does require at least one whitespace character.
Table 2-1 shows the most commonly used special characters in JavaScript applications.
RegExp As Object
The RegExp
is a JavaScript object as well as a literal, so it can also be
created using a constructor, as follows:
var re = new RegExp("Shelley\s+Powers");
When to use which? The RegExp
literal is compiled when script is evaluated, so you should use a
RegExp
literal when you know the
expression won’t change. A compiled version is more efficient. Use the
constructor when the expression changes or is going to be built or
provided at runtime.
As with other JavaScript objects, RegExp
has several properties and methods, the
most common of which are demonstrated throughout this chapter.
Note
Regular expressions are powerful but can be tricky. This chapter is more an introduction to how regular expressions work in JavaScript than to regular expressions in general. If you want to learn more about regular expressions, I recommend the excellent Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan (O’Reilly).
See Also
The jQuery function shown in the first section is a conversion of a jQuery internal function incorporated into a custom jQuery plug-in. jQuery is covered in more detail in Chapter 17, and a jQuery plug-in is covered in Recipe 17.7.
2.1. Testing Whether a Substring Exists
Solution
Use a JavaScript regular expression to define a search pattern,
and then apply the pattern against the string to be searched, using the
RegExp test
method. In the following, we want to match with any string that
has the two words, Cook and
Book, in that order:
var cookbookString = new Array(); cookbookString[0] = "Joe's Cooking Book"; cookbookString[1] = "Sam's Cookbook"; cookbookString[2] = "JavaScript CookBook"; cookbookString[3] = "JavaScript BookCook"; // search pattern var pattern = /Cook.*Book/; for (var i = 0; i < cookbookString.length; i++) alert(cookbookString[i] + " " + pattern.test(cookbookString[i]));
The first and third strings have a positive match, while the second and fourth do not.
Discussion
The RegExp test
method takes
two parameters: the string to test, and an optional modifier. It applies
the regular expression against the string and returns true if there’s a
match, false if there is no match.
In the example, the pattern is the word Cook
appearing somewhere in the string, and the word
Book appearing anywhere in the string after
Cook. There can be any number of characters between
the two words, including no characters, as designated in the pattern by
the two regular expression characters: the decimal point (.
), and the asterisk (*
).
The decimal in regular expressions is a special character that matches any character except the newline character. In the example pattern, the decimal is followed by the asterisk, which matches the preceding character zero or more times. Combined, they generate a pattern matching zero or more of any character, except newline.
In the example, the first and third string match, because they both match the pattern of Cook and Book with anything in between. The fourth does not, because the Book comes before Cook in the string. The second also doesn’t match, because the first letter of book is lowercase rather than uppercase, and the matching pattern is case-dependent.
2.2. Testing for Case-Insensitive Substring Matches
Problem
You want to test whether a string is contained in another string, but you don’t care about the case of the characters in either string.
Solution
When creating the regular expression, use the ignore case flag
(i
):
var cookbookString = new Array();
cookbookString[0] = "Joe's Cooking Book";
cookbookString[1] = "Sam's Cookbook";
cookbookString[2] = "JavaScript CookBook";
cookbookString[3] = "JavaScript cookbook";
// search pattern
var pattern = /Cook.*Book/i;
for (var i = 0; i < cookbookString.length; i++) {
alert(cookbookString[i] + " " + pattern.test(cookbookString[i],i));
}
All four strings match the pattern.
Discussion
The solution uses a regular expression flag (i
) to modify the constraints on the
pattern-matching. In this case, the flag removes the constraint that the
pattern-matching has to match by case. Using this flag, values of
book and Book would both
match.
There are only a few regular expression flags, as shown in Table 2-2. They can be used with RegExp
literals:
var pattern = /Cook.*Book/i; // the 'i' is the ignore flag
They can also be used when creating a RegExp
object, via the optional second
parameter:
var pattern = new RegExp("Cook.*Book","i");
2.3. Validating a Social Security Number
Problem
You need to validate whether a text string is a valid U.S.-based Social Security number (the identifier the tax people use to find us, here in the States).
Solution
Use the String match
method and a regular expression to validate that a string
is a Social Security number:
var ssn = document.getElementById("pattern").value; var pattern = /^\d{3}-\d{2}-\d{4}$/; if (ssn.match(pattern)) alert("OK"); else alert("Not OK");
Discussion
A U.S.-based Social Security number is a combination of nine numbers, typically in a sequence of three numbers, two numbers, and four numbers, with or without dashes in between.
The numbers in a Social Security number can be matched with the
digit special character (\d
). To look for a set number of digits, you can use the curly
brackets surrounding the number of expected digits. In the example, the
first three digits are matched with:
\d{3}
The second two sets of numbers can be defined using the same criteria. Since there’s only one dash between the sequences of digits, it can be given without any special character. However, if there’s a possibility the string will have a Social Security number without the dashes, you’d want to change the regular expression pattern to:
var pattern = /^\d{3}-?\d{2}-?\d{4}$/;
The question mark special character (?
) matches zero or exactly one of the preceding character—in
this case, the dash (-
). With this
change, the following would match:
444-55-3333
As would the following:
555335555
But not the following, which has too many dashes:
555---60--4444
One other characteristic to check is whether the string consists
of the Social Security number, and only the Social Security number. The
beginning-of-input special character (^
) is used to indicate that the Social
Security number begins at the beginning of the string, and the
end-of-line special character ($
)
is used to indicate that the line terminates at the end of
the Social Security number.
Since we’re only interested in verifying that the string is a
validly formatted Social Security number, we’re using the String
object’s match
method. We could also have used the
RegExp
test
method, but six of one, half dozen of the other; both approaches are
acceptable.
There are other approaches to validating a Social Security number that are more complex, based on the principle that Social Security numbers can be given with spaces instead of dashes. That’s why most websites asking for a Social Security number provide three different input fields, in order to eliminate the variations. Regular expressions should not be used in place of good form design.
In addition, there is no way to actually validate that the number given is an actual Social Security number, unless you have more information about the person, and a database with all Social Security numbers. All you’re doing with the regular expression is verifying the format of the number.
See Also
One site that provides some of the more complex Social Security number regular expressions, in addition to many other interesting regular expression “recipes,” is the Regular Expression Library.
2.4. Finding and Highlighting All Instances of a Pattern
Solution
Use the RegExp exec
method and
the global flag (g
) in a loop to locate
all instances of a pattern, such as any word that begins with
t and ends with e, with any
number of characters in between:
var searchString = "Now is the time and this is the time and that is the time"; var pattern = /t\w*e/g; var matchArray; var str = ""; while((matchArray = pattern.exec(searchString)) != null) { str+="at " + matchArray.index + " we found " + matchArray[0] + "<br />"; } document.getElementById("results").innerHTML=str;
Discussion
The RegExp exec
method executes
the regular expression, returning null
if a match is not found, or an array of
information if a match is found. Included in the returned array is the
actual matched value, the index in the string where the match is found,
any parenthetical substring matches, and the original string.
index
The index of the located match
input
The original input string
[0]
or accessing array directlyThe matched value
[1],...,[n]
Parenthetical substring matches
In the solution, the index where the match was found is printed out in addition to the matched value.
The solution also uses the global flag (g
). This triggers the RegExp
object to preserve the location of each
match, and to begin the search after the previously discovered match.
When used in a loop, we can find all instances where the pattern matches
the string. In the solution, the following are printed out:
at 7 we found the at 11 we found time at 28 we found the at 32 we found time at 49 we found the at 53 we found time
Both time and the match the pattern.
Let’s look at the nature of global searching in action. In Example 2-1, a web page is
created with a textarea
and an input
text box for accessing both a search string and a pattern. The pattern
is used to create a RegExp
object,
which is then applied against the string. A result string is built,
consisting of both the unmatched text and the matched text, except the
matched text is surrounded by a span
element, with a CSS class used to highlight the text. The resulting
string is then inserted into the page, using the innerHTML
for a div
element.
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Searching for strings</title> <style type="text/css"> #searchSubmit { background-color: #ff0; width: 200px; text-align: center; padding: 10px; border: 2px inset #ccc; } .found { background-color: #ff0; } </style> <script type="text/javascript"> //<![CDATA[ window.onload=function() { document.getElementById("searchSubmit").onclick=doSearch; } function doSearch() { // get pattern var pattern = document.getElementById("pattern").value; var re = new RegExp(pattern,"g"); // get string var searchString = document.getElementById("incoming").value; var matchArray; var resultString = "<pre>"; var first=0; var last=0; // find each match while((matchArray = re.exec(searchString)) != null) { last = matchArray.index; // get all of string up to match, concatenate resultString += searchString.substring(first, last); // add matched, with class resultString += "<span class='found'>" + matchArray[0] + "</span>"; first = re.lastIndex; } // finish off string resultString += searchString.substring(first,searchString.length); resultString += "</pre>"; // insert into page document.getElementById("searchResult").innerHTML = resultString; } //--><!]]> </script> </head> <body> <form id="textsearch"> <textarea id="incoming" cols="150" rows="10"> </textarea> <p> Search pattern: <input id="pattern" type="text" /></p> </form> <p id="searchSubmit">Search for pattern</p> <div id="searchResult"></div> </body> </html>
Figure 2-1 shows the application in action on William Wordsworth’s poem, “The Kitten and the Falling Leaves,” after a search for the following pattern:
lea(f|ve)
The bar (|
) is a conditional
test, and will match a word based on the value on either side of the
bar. So a word like leaf
matches, as
well as a word like leave
, but not a
word like leap
.
You can access the last index found through the RegExp’s lastIndex
property. The lastIndex
property
is handy if you want to track both the first and last matches.
See Also
Recipe 2.5 describes another way to do a standard find-and-replace behavior, and Recipe 2.6 provides a simpler approach to finding and highlighting text in a string.
2.5. Replacing Patterns with New Strings
Solution
Use the String
object’s replace
method, with a regular expression:
var searchString = "Now is the time, this is the time"; var re = /t\w{2}e/g; var replacement = searchString.replace(re, "place"); alert(replacement); // Now is the place, this is the place
Discussion
In Example 2-1 in
Recipe 2.4, we used the
RegExp
global flag (g
) in order to track each occurrence of the
regular expression. Each match was highlighted using a span
element and CSS.
A global search is also handy for a typical find-and-replace
behavior. Using the global flag (g
)
with the regular expression in combination with the String replace
method will replace all
instances of the matched text with the replacement string.
See Also
Recipe 2.6
demonstrates variations of using regular expressions with the String replace
method.
2.6. Swap Words in a String Using Capturing Parentheses
Problem
You want to accept an input string with first and last name, and swap the names so the last name is first.
Solution
Use capturing parentheses and a regular expression to find and remember the two names in the string, and reverse them:
var name = "Abe Lincoln"; var re = /^(\w+)\s(\w+)$/; var newname = name.replace(re,"$2, $1");
Discussion
Capturing parentheses allow us to not only match specific patterns
in a string, but to reference the matched substrings at a later time.
The matched substrings are referenced numerically, from left to right,
as represented by the use of “$1” and “$2” in the
String
replace
method.
In the solution, the regular expression matches two words, separated by a space. Capturing parentheses were used with both words, so the first name is accessible using “$1”, the last name with “$2”.
The captured parentheses aren’t the only special characters
available with the String
replace
method. Table 2-3 shows the other special
characters that can be used with regular expressions and replace
.
The second table entry, which reinserts the matched substring, can
be used to provide a simplified version of the Example 2-1 application in
Recipe 2.4. That
example found and provided markup and CSS to highlight the matched
substring. It used a loop to find and replace all entries, but in Example 2-2 we’ll use the
String replace
method with the
matched substring special pattern ($&
)
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Searching for strings</title> <style> #searchSubmit { background-color: #ff0; width: 200px; text-align: center; padding: 10px; border: 2px inset #ccc; } .found { background-color: #ff0; } </style> <script> //<![CDATA[ window.onload=function() { document.getElementById("searchSubmit").onclick=doSearch; } function doSearch() { // get pattern var pattern = document.getElementById("pattern").value; var re = new RegExp(pattern,"g"); // get string var searchString = document.getElementById("incoming").value; // replace var resultString = searchString.replace(re,"<span class='found'>$&</span>"); // insert into page document.getElementById("searchResult").innerHTML = resultString; } //--><!]]> </script> </head> <body> <form id="textsearch"> <textarea id="incoming" cols="100" rows="10"> </textarea> <p> Search pattern: <input id="pattern" type="text" /></p> </form> <p id="searchSubmit">Search for pattern</p> <div id="searchResult"></div> </body> </html>
This is a simpler alternative, but as Figure 2-2 shows, this technique doesn’t quite preserve all aspects of the original string. The line feeds aren’t preserved with Example 2-2, but they are with Example 2-1.
The captured text can also be accessed via the RegExp
object when you use the RegExp exec
method. Now let’s return to the
Recipe 2.6 solution
code, but this time using the RegExp
’s exec
method:
var name = "Shelley Powers"; var re = /^(\w+)\s(\w+)$/; var result = re.exec(name); var newname = result[2] + ", " + result[1];
This approach is handy if you want to access the capturing
parentheses values, but without having to use them within a string
replacement. To see another example of using capturing parentheses,
Recipe 1.7 demonstrated
a couple of ways to access the list of items in the following sentence,
using the String
split
method:
var sentence = "This is one sentence. This is a sentence with a list of items: cherries, oranges, apples, bananas.";
Another approach is the following, using capturing parentheses,
and the RegExp exec
method:
var re = /:(.*)\./; var result = re.exec(sentence); var list = result[1]; // cherries, oranges, apples, bananas
2.7. Using Regular Expressions to Trim Whitespace
Problem
Before sending a string to the server via an Ajax call, you want to trim whitespace from the beginning and end of the string.
Solution
Prior to the new ECMAScript 5 specification, you could use a regular expression to trim whitespace from the beginning and end of a string:
var testString = " this is the string "; // trim white space from the beginning testString = testString.replace(/^\s+/,""); // trim white space from the end testString = testString.replace(/\s+$/,"");
Beginning with ECMAScript 5, the String
object now has a trim
method:
var testString = " this is the string "; testString = testString.trim(); // white space trimmed
Discussion
String
values retrieved from
form elements can sometimes have whitespace before and after the actual
form value. You don’t usually want to send the string with the
extraneous whitespace, so you’ll use a regular expression to trim the
string.
Beginning with ECMAScript 5, there’s now a String trim
method. However, until ECMAScript 5 has wider use, you’ll want to
check to see if the trim
method
exists, and if not, use the old regular expression method as a fail-safe
method.
In addition, there is no left or right trim in ECMAScript 5, though there are nonstandard versions of these methods in some browsers, such as Firefox. So if you want left- or right-only trim, you’ll want to create your own functions:
function leftTrim(str) { return str.replace(/^\s+/,""); } function rightTrim(str) { return str.replace(/\s+$/,""); }
2.8. Replace HTML Tags with Named Entities
Problem
You want to paste example markup into a web page, and escape the markup—have the angle brackets print out rather than have the contents parsed.
Solution
Use regular expressions to convert angle brackets (<>
) into the named entities <
and >
:
var pieceOfHtml = "<p>This is a <span>paragraph</span></p>"; pieceOfHtml = pieceOfHtml.replace(/</g,"<"); pieceOfHtml = pieceOfHtml.replace(/>/g,">"); document.getElementById("searchResult").innerHTML = pieceOfHtml;
Discussion
It’s not unusual to want to paste samples of markup into another web page. The only way to have the text printed out, as is, without having the browser parse it, is to convert all angle brackets into their equivalent named entities.
The process is simple with the use of regular expressions, using
the regular expression global flag (g
) and the String
replace
method, as demonstrated in the solution.
2.9. Searching for Special Characters
Problem
We’ve searched for numbers and letters, and anything not a number or other character, but one thing we need to search is the special regular expression characters themselves.
Solution
Use the backslash to escape the pattern-matching character:
var re = /\\d/; var pattern = "\\d{4}"; var pattern2 = pattern.replace(re,"\\D");
Discussion
In the solution, a regular expression is created that’s equivalent
to the special character, \d
, used to
match on any number. The pattern is, itself, escaped, in the string that
needs to be searched. The number special character is then replaced with
the special character that searches for anything but a number, \D
.
Sounds a little convoluted, so I’ll demonstrate with a longer
application. Example 2-3
shows a small application that first searches for a sequence of four
numbers in a string, and replaces them with four asterisks (****
). Next, the application will modify the
search pattern, by replacing the \d
with \D
, and then running it against
the same string.
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Replacement Insanity</title> <script> //<![CDATA[ window.onload=function() { // search for \d var re = /\\d/; var pattern = "\\d{4}"; var str = "I want 1111 to find 3334 certain 5343 things 8484"; var re2 = new RegExp(pattern,"g"); var str1 = str.replace(re2,"****"); alert(str1); var pattern2 = pattern.replace(re,"\\D"); var re3 = new RegExp(pattern2,"g"); var str2 = str.replace(re3, "****"); alert(str2); } //--><!]]> </script> </head> <body> <p>content</p> </body> </html>
Here is the original string:
I want 1111 to find 3334 certain 5343 things 8484
The first string printed out is the original string with the numbers converted into asterisks:
I want **** to find **** certain **** things ****
The second string printed out is the same string, but after the characters have been converted into asterisks:
****nt 1111******** 3334******** 5343********8484
Though this example is short, it demonstrates some of the challenges when you want to search on regular expression characters themselves.
Get JavaScript Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.