This chapter contains recipes for validating and formatting common types of user input. Some of the solutions show how to allow variations of valid input, such as U.S. postal codes that can contain either five or nine digits. Others are designed to harmonize or fix commonly understood formats for things such as phone numbers, dates, and credit card numbers.
Beyond helping you get the job done by eliminating invalid input, these recipes can also improve the user experience of your applications. Messages such as “no spaces or hyphens” next to phone or credit card number fields often frustrate users or are simply ignored. Fortunately, in many cases regular expressions allow you to let users enter data in formats that they find familiar and comfortable with very little extra work on your part.
Certain programming languages provide functionality similar to some recipes in this chapter through their native classes or libraries. Depending on your needs, it might make more sense to use these built-in options, so we’ll point them out along the way.
You have a form on your website or a dialog box in your application that asks the user for an email address. You want to use a regular expression to validate this email address before trying to send email to it. This reduces the number of emails returned to you as undeliverable.
The first solution does a very simple check. It only verifies that the email address has a single at (@) sign and no whitespace:
^\S+@\S+$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
\A\S+@\S+\Z
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
The domain name, the part after the @ sign, is restricted to characters allowed in domain names. The username, the part before the @ sign, is restricted to characters commonly used in email usernames, which is more restrictive than what most email clients and servers will accept:
^[A-Z0-9+_.-]+@[A-Z0-9.-]+$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
\A[A-Z0-9+_.-]+@[A-Z0-9.-]+\Z
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
This regular expression expands the previous one by allowing a larger set of rarely used characters in the username. Not all email software can handle all these characters, but we’ve included all the characters permitted by RFC 2822, which governs the email message format. Among the permitted characters are some that present a security risk if passed directly from user input to an SQL statement, such as the single quote (') and the pipe character (|). Be sure to escape sensitive characters when inserting the email address into a string passed to another program, in order to prevent security holes such as SQL injection attacks:
^[\w!#$%&'*+/=?`{|}~^.-]+@[A-Z0-9.-]+$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
\A[\w!#$%&'*+/=?`{|}~^.-]+@[A-Z0-9.-]+\Z
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
Both the username and the domain name can contain one or more dots, but no two dots can appear right next to each other. Furthermore, the first and last characters in the username and in the domain name must not be dots:
^[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@↵ [A-Z0-9-]+(?:\.[A-Z0-9-]+)*$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
\A[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@↵ [A-Z0-9-]+(?:\.[A-Z0-9-]+)*\Z
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
This regular expression adds to the previous versions by
specifying that the domain name must include at least one dot, and
that the part of the domain name after the dot can only consist of
letters. That is, the domain must contain at least two levels, such
as secondlevel.com
or thirdlevel.secondlevel.com
. The top-level
domain, .com
, must consist of two
to six letters. All country-code top-level domains have two letters.
The generic top-level domains have between three (.com
) and six letters (.museum
):
^[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@↵ (?:[A-Z0-9-]+\.)+[A-Z]{2,6}$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
\A[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@↵ (?:[A-Z0-9-]+\.)+[A-Z]{2,6}\Z
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
If you thought something as conceptually simple as validating an email address would have a simple one-size-fits-all regex solution, you’re quite wrong. This recipe is a prime example that before you can start writing a regular expression, you have to decide exactly what you want to match. There is no universally agreed upon rule as to which email addresses are valid and which not. It depends on your definition of valid.
asdf@asdf.asdf
is valid according to
RFC 2822, which defines the syntax for email addresses. But it is
not valid if your definition specifies that a valid email address is
one that accepts mail. There is no top-level asdf
domain.
The short answer to the validity problem is that you can’t
know whether john.doe@somewhere.com
is an email
address that can actually receive email until you try to send email
to it. And even then, you can’t be sure if the lack of response
signals that the somewhere.com
domain is silently
discarding mail sent to nonexistent mailboxes, or if John Doe hit
the Delete button on his keyboard, or if his spam filter beat him to
it.
Because you ultimately have to check whether the address
exists by actually sending email to it, you can decide to use a
simpler or more relaxed regular expression. Allowing invalid
addresses to slip through may be preferable to annoying people by
blocking valid addresses. For this reason, you may want to select
the “simple, with all characters” regular expression. Though it
obviously allows many things that aren’t email addresses, such as
#$%@.-
, the
regex is quick and simple, and will never block a valid email
address.
If you want to avoid sending too many undeliverable emails, while still not blocking any real email addresses, the regex in Top-level domain has two to six letters is a good choice.
You have to consider how complex you want your regular expression to be. If you’re validating user input, you’ll likely want a more complex regex, because the user could type in anything. But if you’re scanning database files that you know contain only valid email addresses, you can use a very simple regex that merely separates the email addresses from the other data. Even the solution in the earlier subsection may be enough in this case.
Finally, you have to consider how future-proof you want your
regular expression to be. In the past, it made sense to restrict the
top-level domain to only two-letter combinations for the country
codes, and exhaustively list the generic top-level domains, i.e.,
‹com|net|org|mil|edu
›.
With new top-level domains being added all the time, such regular
expressions now quickly go out of date.
The regular expressions presented in this recipe show all the basic parts of the regular expression syntax in action. If you read up on these parts in Chapter 2, you can already do 90% of the jobs that are best solved with regular expressions.
All the regular expressions require the case-insensitive
matching option to be turned on. Otherwise, only uppercase characters will be allowed.
Turning on this option allows you to type ‹[A-Z]
› instead of ‹[A-Za-z]
›, saving a few keystrokes. If you use
one of the last two regular expressions, the case-insensitivity
option is very handy. Otherwise, you’d have to replace every letter
‹X
› with ‹[Xx]
›.
‹\S
› and
‹\w
› are shorthand
character classes, as the recipe in Recipe 2.3 explains. ‹\S
›
matches any nonwhitespace character, whereas ‹\w
› matches a word
character.
‹@
› and ‹\.
› match a literal @ sign and
a dot, respectively. Since the dot is a metacharacter when used outside
character classes, it needs to be escaped with a backslash. The @
sign never has a special meaning with any of the regular expression
flavors in this book. Recipe 2.1 gives you
a list of all the metacharacters that need to be escaped.
‹[A-Z0-9.-]
› and
the other sequences between square brackets are
character classes. This one allows all letters between A and Z, all
digits between 0 and 9, as well as a literal dot and hyphen. Though
the hyphen normally creates a range in a character class, the hyphen
is treated as a literal when it occurs as the last character in a
character class. The recipe in Recipe 2.3
tells you all about character classes, including combining them with
shorthands, as in ‹[\w!#$%&'*+/=?`{|}~^.-]
›. This class
matches a word character, as well as any of the 19 listed
punctuation characters.
‹+
› and ‹*
›, when used outside character classes, are quantifiers.
The plus sign repeats the preceding regex token one or more times,
whereas the asterisk repeats it zero or more times. In these regular
expressions, the quantified token is usually a character class, and
sometimes a group. Therefore, ‹[A-Z0-9.-]+
› matches one or more letters,
digits, dots, and/or hyphens.
As an example of the use of a group, ‹(?:[A-Z0-9-]+\.)+
› matches one or more
letters, digits, and/or hyphens, followed by one literal dot. The
plus sign repeats this group one or more times. The group must match
at least once, but can match as many times as possible. Recipe 2.12 explains the mechanics of constructs
such as these in detail.
‹(?:group)
› is a
noncapturing group. Use it to create a group from part
of the regular expression so you can apply a quantifier to the group
as a whole. The capturing group ‹(group)
› does the same thing with a cleaner
syntax, so you could replace ‹(?:
› with ‹(
› in all of the regular expressions we’ve
used so far without changing the overall match results.
But since we’re not interested in separately capturing parts of the email address, the noncapturing group is somewhat more efficient, although it makes the regular expression somewhat harder to read. Recipe 2.9 tells you all about capturing and noncapturing groups.
The anchors ‹^
› and
‹$
› force the regular
expression to find its match at the start and end of the subject
text, respectively. Placing the whole regular expression between
these characters effectively requires the regular expression to
match the entire subject.
This is important when validating user input. You do not want
to accept drop
database;
-- joe@server.com
haha!
as a valid email address. Without the
anchors, all the previous regular expressions will match because
they find joe@server.com
in the middle of the
given text. See Recipe 2.5 for details.
That recipe also explains why the “caret and dollar match at line
breaks” matching option must be off.
In Ruby, the caret and dollar always match at line breaks. The
regular expressions using the caret and dollar work correctly in
Ruby, but only if the string you’re trying to validate contains no
line breaks. If the string may contain line breaks, all the regexes
using ‹^
› and ‹$
› will match the email
address in drop
database; --
, where LF
joe@server.comLF
haha!LF
represents a line break.
To avoid this, use the anchors ‹\A
› and ‹\Z
› instead. These match at the start and end
of the string only, regardless of any options, in all flavors
discussed in this book, except JavaScript. JavaScript does not
support ‹\A
› and
‹\Z
› at all. Recipe 2.5 explains these anchors.
Tip
The issue with ‹^
› and ‹$
› versus ‹\A
› and ‹\Z
› applies to all regular expressions that
validate input. There are a lot of these in this book. Although we
will offer the occasional reminder, we will not constantly repeat
this advice or show separate solutions for JavaScript and Ruby for
each and every recipe. In many cases, we’ll show only one solution
using the caret and dollar, and list Ruby as a compatible flavor.
If you’re using Ruby, remember to use ‹\A
› and ‹\Z
› if you want to avoid matching one line
in a multiline string.
This recipe illustrates how you can build a regular expression step-by-step. This technique is particularly handy with an interactive regular expression tester, such as RegexBuddy.
First, load a bunch of valid and invalid sample data into the tool. In this case, that would be a list of valid email addresses and a list of invalid email addresses.
Then, write a simple regular expression that matches all the
valid email addresses. Ignore the invalid addresses for now.
‹^\S+@\S+$
› already
defines the basic structure of an email address: a username, an at
sign, and a domain name.
With the basic structure of your text pattern defined, you can refine each part until your regular expression no longer matches any of the invalid data. If your regular expression only has to work with previously existing data, that can be a quick job. If your regex has to work with any user input, editing the regular expression until it is restrictive enough will be a much harder job than just getting it to match the valid data.
If you want to search for email addresses in larger bodies of
text instead of checking whether the input as a whole is an email
address, you cannot use the anchors ‹^
› and ‹$
›. Merely removing the anchors from the regular
expression is not the right solution. If you do that with the final
regex, which restricts the top-level domain to letters, it will match
asdf@asdf.as
in asdf@asdf.as99
, for example. Instead of
anchoring the regex match to the start and end of the subject, you
have to specify that the start of the username and the top-level
domain cannot be part of longer words.
This is easily done with a pair of word boundaries. Replace both ‹^
› and ‹$
› with ‹\b
›. For instance,
‹^[A-Z0-9+_.-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$
›
becomes ‹\b[A-Z0-9+_.-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b
›.
This regex indeed combines the username portion from Simple, with restrictions on characters and the domain name portion from Top-level domain has two to six letters. We find that this regular expression works quite well in practice.
RFC 2822 defines the structure and syntax of email messages, including the email addresses used in email messages. You can download RFC 2822 at http://www.ietf.org/rfc/rfc2822.txt.
Get Regular Expressions Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.