You want to determine whether a user entered a North
American phone number, including the local area code, in a common
format. These formats include 1234567890
, 123-456-7890
, 123.456.7890
, 123 456 7890
, (123) 456 7890
, and all related
combinations. If the phone number is valid, you want to convert it to
your standard format, (123) 456-7890
, so that
your phone number records are consistent.
A regular expression can easily check whether a user entered something that looks like a valid phone number. By using capturing groups to remember each set of digits, the same regular expression can be used to replace the subject text with precisely the format you want.
^\(?([0-9]{3})\)?[-.●]?([0-9]{3})[-.●]?([0-9]{4})$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
($1)●$2-$3
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
(\1)●\2-\3
Replacement text flavors: Python, Ruby |
Regex phoneRegex = new Regex(@"^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$"); if (phoneRegex.IsMatch(subjectString)) { string formattedPhoneNumber = phoneRegex.Replace(subjectString, "($1) $2-$3"); } else { // Invalid phone number }
var phoneRegex = /^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$/; if (phoneRegex.test(subjectString)) { var formattedPhoneNumber = subjectString.replace(phoneRegex, "($1) $2-$3"); } else { // Invalid phone number }
If you need help converting the examples just listed to your programming language of choice, Recipe 3.6 shows how to implement the test of whether a regex matches the entire subject, and Recipe 3.15 has code listings for performing a replacement that reuses parts of a match (done here to reformat the phone number).
This regular expression matches three groups of digits. The first group can optionally be enclosed with parentheses, and the first two groups can optionally be followed with a choice of three separators (a hyphen, dot, or space). The following layout breaks the regular expression into its individual parts, omitting the redundant groups of digits:
^ # Assert position at the beginning of the string.
\( # Match a literal "("
? # between zero and one time.
( # Capture the enclosed match to backreference 1:
[0-9] # Match a digit
{3} # exactly three times.
) # End capturing group 1.
\) # Match a literal ")"
? # between zero and one time.
[-. ] # Match one hyphen, dot, or space
? # between zero and one time.
⋯ # [Match the remaining digits and separator.]
$ # Assert position at the end of the string.
Let’s look at each of these parts more closely.
The ‹^
› and ‹$
› at the beginning and end of the
regular expression are a special kind of metacharacter called an
anchor or assertion. Instead of matching
text, assertions match a position within the text. Specifically,
‹^
› matches at the
beginning of the text, and ‹$
› at the end. This ensures that the phone number
regex does not match within longer text, such as 123-456-78901
.
As we’ve repeatedly seen, parentheses are special characters in
regular expressions, but in this case we want to allow a user to enter
parentheses and have our regex recognize them. This is a textbook
example of where we need a backslash to escape a special character so
the regular expression treats it as literal input. Thus, the ‹\(
› and ‹\)
› sequences that enclose the first group of
digits match literal parenthesis characters. Both are followed by a
question mark, which makes them optional. We’ll explain more about the
question mark after discussing the other types of tokens in this regular
expression.
The parentheses that appear without backslashes are capturing groups and are used to remember the values matched within them so that the matched text can be recalled later. In this case, backreferences to the captured values are used in the replacement text so we can easily reformat the phone number as needed.
Two other types of tokens used in this regular expression are
character classes and quantifiers. Character classes allow you to match
any one out of a set of characters. ‹[0-9]
› is a character class that matches any
digit. The regular expression flavors covered by this book all include
the shorthand character class ‹\d
› that
also matches a digit, but in some flavors ‹\d
› matches a digit from any language’s character
set or script, which is not what we want here. See Recipe 2.3 for more information about ‹\d
›.
‹[-.●]
› is another character class, one that
allows any one of three separators. It’s important that the hyphen
appears first or last in this character class, because if it appeared
between other characters, it would create a range, as with ‹[0-9]
›. Another way to ensure that
a hyphen inside a character class matches a literal version of itself is
to escape it with a backslash. ‹[.\-●]
› is therefore equivalent. The ‹●
› represents a literal space
character.
Finally, quantifiers allow you to repeatedly match a token or
group. ‹{3}
› is a
quantifier that causes its preceding element to be matched exactly three
times. The regular expression ‹[0-9]{3}
› is therefore equivalent to ‹[0-9][0-9][0-9]
›, but is shorter
and hopefully easier to read. A question mark (mentioned earlier) is a
quantifier that causes its preceding element to match zero or one time.
It could also be written as ‹{0,1}
›. Any quantifier that allows something to
match zero times effectively makes that element optional. Since a
question mark is used after each separator, the phone number digits are
allowed to run together.
Tip
Note that although this recipe claims to handle North American phone numbers, it’s actually designed to work with North American Numbering Plan (NANP) numbers. The NANP is the telephone numbering plan for the countries that share the country code “1.” This includes the United States and its territories, Canada, Bermuda, and 17 Caribbean nations. It excludes Mexico and the Central American nations.
So far, the regular expression matches any 10-digit number. If you want to limit matches to valid phone numbers according to the North American Numbering Plan, here are the basic rules:
Area codes start with a number 2–9, followed by 0–8, and then any third digit.
The second group of three digits, known as the central office or exchange code, starts with a number 2–9, followed by any two digits.
The final four digits, known as the station code, have no restrictions.
These rules can easily be implemented with a few character classes.
^\(?([2-9][0-8][0-9])\)?[-.●]?([2-9][0-9]{2})[-.●]?([0-9]{4})$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Beyond the basic rules just listed, there are a variety of reserved, unassigned, and restricted phone numbers. Unless you have very specific needs that require you to filter out as many phone numbers as possible, don’t go overboard trying to eliminate unused numbers. New area codes that fit the rules listed earlier are made available regularly, and even if a phone number is valid, that doesn’t necessarily mean it was issued or is in active use.
Two simple changes allow the previous regular expressions to match phone numbers within longer text:
\(?\b([0-9]{3})\)?[-.●]?([0-9]{3})[-.●]?([0-9]{4})\b
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Here, the ‹^
› and
‹$
› assertions that
bound the regular expression to the beginning and end of the text have
been removed. In their place, word boundary tokens (‹\b
›) have been added to ensure
that the matched text stands on its own and is not part of a longer
number or word.
Similar to ‹^
› and
‹$
›, ‹\b
› is an
assertion that matches a position rather than any actual text.
Specifically, ‹\b
›
matches the position between a word character and either a nonword
character or the beginning or end of the text. Letters, numbers, and
underscore are all considered word characters (see Recipe 2.6).
Note that the first word boundary token appears after the optional, opening parenthesis. This is important because there is no word boundary to be matched between two nonword characters, such as the opening parenthesis and a preceding space character. The first word boundary is relevant only when matching a number without parentheses, since the word boundary always matches between the opening parenthesis and the first digit of a phone number.
You can allow an optional, leading “1” for the country code (which covers the North American Numbering Plan region) via the addition shown in the following regex:
^(?:\+?1[-.●]?)?\(?([0-9]{3})\)?[-.●]?([0-9]{3})[-.●]?([0-9]{4})$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In addition to the phone number formats shown
previously, this regular expression will also match strings such as
+1 (123)
456-7890
and 1-123-456-7890
. It uses a noncapturing
group, written as ‹(?:⋯)
›. When a question
mark follows an unescaped left parenthesis like this, it’s not a
quantifier, but instead helps to identify the type of grouping.
Standard capturing groups require the regular expression engine to
keep track of backreferences, so it’s more efficient to use
noncapturing groups whenever the text matched by a group does not need
to be referenced later. Another reason to use a noncapturing group
here is to allow you to keep using the same replacement string as in
the previous examples. If we added a capturing group, we’d have to
change $1
to $2
(and so on) in the replacement text shown
earlier in this recipe.
The full addition to this version of the regex is ‹(?:\+?1[-.●]?)?
›. The “1” in this pattern is
preceded by an optional plus sign, and optionally followed by one of
three separators (hyphen, dot, or space). The entire, added
noncapturing group is also optional, but since the “1” is required
within the group, the preceding plus sign and separator are not
allowed if there is no leading “1.”
To allow matching phone numbers that omit the local area code, enclose the first group of digits together with its surrounding parentheses and following separator in an optional, noncapturing group:
^(?:\(?([0-9]{3})\)?[-.●]?)?([0-9]{3})[-.●]?([0-9]{4})$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Since the area code is no longer required as part of the match,
simply replacing any match with «($1)●$2-$3
» might now result in something
like () 123-4567
, with an empty set of
parentheses. To work around this, add code outside the regex that
checks whether group 1 matched any text, and adjust the replacement
text accordingly.
Recipe 4.3 shows how to validate international phone numbers.
As noted previously, the North American Numbering Plan (NANP) is the telephone numbering plan for the United States and its territories, Canada, Bermuda, and 17 Caribbean nations. More information is available at http://www.nanpa.com.
Techniques used in the regular expressions and replacement text in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.6 explains word boundaries. Recipe 2.21 explains how to insert text matched by capturing groups into the replacement text.
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.