All the programming languages covered by this book provide a
simple, efficient way to check the length of text. For example,
JavaScript strings have a length
property that
holds an integer indicating the string’s length. However, using regular
expressions to check text length can be useful in some situations,
particularly when length is only one of multiple rules that determine
whether the subject text fits the desired pattern. The following regular
expression ensures that text is between 1 and 10 characters long, and
additionally limits the text to the uppercase letters A–Z. You can
modify the regular expression to allow any minimum or maximum text
length, or allow characters other than A–Z.
if ($ARGV[0] =~ /^[A-Z]{1,10}$/) { print "Input is valid\n"; } else { print "Input is invalid\n"; }
See Recipe 3.6 for help with implementing this regular expression with other programming languages.
Here’s the breakdown for this very straightforward regex:
^ # Assert position at the beginning of the string. [A-Z] # Match one letter from A to Z {1,10} # between 1 and 10 times. $ # Assert position at the end of the string.
Regex options: Free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
The ‹^
› and
‹$
› anchors
ensure that the regex matches the entire subject string; otherwise, it
could match 10 characters within longer text. The ‹[A-Z]
› character class matches any
single uppercase character from A to Z, and the interval quantifier
‹{1,10}
› repeats the
character class from 1 to 10 times. By combining the interval quantifier
with the surrounding start- and end-of-string anchors, the regex will
fail to match if the subject text’s length falls outside the desired
range.
Note that the character class ‹[A-Z]
› explicitly allows only uppercase letters.
If you want to also allow the lowercase letters a to z, you can either
change the character class to ‹[A-Za-z]
› or apply the case insensitive option.
Recipe 3.4 shows how to do this.
Tip
A mistake commonly made by new regular expression users is to
try to save a few characters by using the character class range
‹[A-z]
›. At first
glance, this might seem like a clever trick to allow all uppercase and
lowercase letters. However, the ASCII character table includes several
punctuation characters in positions between the A–Z and a–z ranges.
Hence, ‹[A-z]
› is
actually equivalent to ‹[A-Z[\]^_`a-z]
›.
Because quantifiers such as ‹{1,10}
› apply only to the immediately preceding
element, limiting the number of characters that can be matched by
patterns that include more than a single token requires a different
approach.
As explained in Recipe 2.16,
lookaheads (and their counterpart, lookbehinds) are a special kind of
assertion that, like ‹^
› and
‹$
›, match a position
within the subject string and do not consume any characters.
Lookaheads can be either positive or negative, which means they can
check if a pattern follows or does not follow the current position in
the match. A positive lookahead, written as ‹(?=⋯)
›, can be used at the beginning of the
pattern to ensure that the string is within the target length range.
The remainder of the regex can then validate the desired pattern
without worrying about text length. Here’s a simple example:
^(?=.{1,10}$).*
Regex options: Dot matches line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
^(?=[\S\s]{1,10}$)[\S\s]*
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
It is important that the ‹$
› anchor appears inside the lookahead because
the maximum length test works only if we ensure that there are no more
characters after we’ve reached the limit. Because the lookahead at the
beginning of the regex enforces the length range, the following
pattern can then apply any additional validation rules. In this case,
the pattern ‹.*
› (or
‹[\S\s]*
› in the version
that adds native JavaScript support) is used to simply match the
entire subject text with no added constraints.
The first regex uses the “dot matches line breaks” option so that it will work correctly when your subject string contains line breaks. See Recipe 3.4 for details about how to apply this modifier with your programming language. Standard JavaScript without XRegExp doesn’t have a “dot matches line breaks” option, so the second regex uses a character class that matches any character. See Any character including line breaks for more information.
The following regex matches any string that contains between 10 and 100 nonwhitespace characters:
^\s*(?:\S\s*){10,100}$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
By default, ‹\s
› in
.NET, JavaScript, Perl, and Python 3.x matches all Unicode whitespace,
and ‹\S
› matches
everything else. In Java, PCRE, Python 2.x, and Ruby, ‹\s
› matches ASCII whitespace
only, and ‹\S
› matches
everything else. In Python 2.x, you can make ‹\s
› match all Unicode whitespace by passing the
UNICODE
or U
flag
when creating the regex. In Java 7, you can make ‹\s
› match all Unicode whitespace
by passing the UNICODE_CHARACTER_CLASS
flag. Developers using
Java 4 to 6, PCRE, and Ruby 1.9 who want to avoid having any Unicode
whitespace count against their character limit can switch to the
following version of the regex that takes advantage of Unicode
categories (described in Recipe 2.7):
^[\p{Z}\s]*(?:[^\p{Z}\s][\p{Z}\s]*){10,100}$
Regex options: None |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Ruby 1.9 |
PCRE must be compiled with UTF-8 support for this to
work. In PHP, turn on UTF-8 support with the /u
pattern modifier.
This latter regex combines the Unicode ‹\p{Z}
›
Separator property with the ‹\s
› shorthand for whitespace. That’s because the
characters matched by ‹\p{Z}
›
and ‹\s
› do not
completely overlap. ‹\s
›
includes the characters at positions 0x09 through 0x0D (tab, line
feed, vertical tab, form feed, and carriage return), which are not
assigned the Separator property by the Unicode standard. By combining
‹\p{Z}
›
and ‹\s
› in a character
class, you ensure that all whitespace characters are matched.
In both regexes, the interval quantifier ‹{10,100}
› is applied to the
noncapturing group that precedes it, rather than a single token. The
group matches any single nonwhitespace character followed by zero or
more whitespace characters. The interval quantifier can reliably track
how many nonwhitespace characters are matched because exactly one
nonwhitespace character is matched during each iteration.
The following regex is very similar to the previous example of limiting the number of nonwhitespace characters, except that each repetition matches an entire word rather than a single nonwhitespace character. It matches between 10 and 100 words, skipping past any nonword characters, including punctuation and whitespace:
^\W*(?:\w+\b\W*){10,100}$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In Java 4 to 6, JavaScript, PCRE, Python 2.x, and Ruby, the word
character token ‹\w
› in
this regex will match only the ASCII characters A–Z, a–z, 0–9, and _,
and therefore this cannot correctly count words that contain non-ASCII
letters and numbers. In .NET and Perl, ‹\w
› is based on the Unicode table (as is its
inverse, ‹\W
›, and the
word boundary ‹\b
›) and
will match letters and digits from all Unicode scripts. In Python 2.x,
you can choose to make these tokens Unicode-based by passing the
UNICODE
or U
flag when creating
the regex. In Python 3.x, they are Unicode-based by default. In Java
7, you can choose to make the shorthands for word and nonword
characters Unicode-based by passing the UNICODE_CHARACTER_CLASS
flag. Java’s ‹\b
› is
always Unicode-based.
If you want to count words that contain non-ASCII letters and numbers, the following regexes provide this capability for additional regex flavors:
^[^\p{L}\p{M}\p{Nd}\p{Pc}]*(?:[\p{L}\p{M}\p{Nd}\p{Pc}]+↵ \b[^\p{L}\p{M}\p{Nd}\p{Pc}]*){10,100}$
Regex options: None |
Regex flavors: .NET, Java, Perl |
^[^\p{L}\p{M}\p{Nd}\p{Pc}]*(?:[\p{L}\p{M}\p{Nd}\p{Pc}]+↵ (?:[^\p{L}\p{M}\p{Nd}\p{Pc}]+|$)){10,100}$
Regex options: None |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Ruby 1.9 |
PCRE must be compiled with UTF-8 support for this to work. In
PHP, turn on UTF-8 support with the /u
pattern modifier.
As noted, the reason for these different (but equivalent) regexes is the varying handling of the word character and word boundary tokens, explained more fully in Word Characters.
The last two regexes use character classes that include the
separate Unicode categories for letters, marks (necessary for matching
words of many languages), decimal numbers, and connector punctuation
(the underscore and similar characters), which makes them equivalent
to the earlier regex that used ‹\w
› and ‹\W
›.
Each repetition of the noncapturing group in the first two of
these three regexes matches an entire word followed by zero or more
nonword characters. The ‹\W
› (or ‹[^\p{L}\p{M}\p{Nd}\p{Pc}]
›) token inside the
group is allowed to repeat zero times in case the string ends with a
word character. However, since this effectively makes the nonword
character sequence optional throughout the matching process, the word
boundary assertion ‹\b
›
is needed between ‹\w
›
and ‹\W
› (or ‹[\p{L}\p{M}\p{Nd}\p{Pc}]
› and
‹[^\p{L}\p{M}\p{Nd}\p{Pc}]
›), to ensure that each
repetition of the group really matches an entire word. Without the
word boundary, a single repetition would be allowed to match any part
of a word, with subsequent repetitions matching additional
pieces.
The third version of the regex (which adds support for XRegExp,
PCRE, and Ruby 1.9) works a bit differently. It uses a plus (one or
more) instead of an asterisk (zero or more) quantifier, and explicitly
allows matching zero characters only if the matching process has
reached the end of the string. This allows us to avoid the word
boundary token, which is necessary to ensure accuracy since ‹\b
› is not Unicode-enabled in
XRegExp, PCRE, or Ruby. ‹\b
› is Unicode-enabled in
Java, even though Java’s ‹\w
› is not (unless you use the UNICODE_CHARACTER_CLASS
flag in Java 7).
Unfortunately, none of these options allow standard JavaScript or Ruby 1.8 to correctly handle words that use non-ASCII characters. A possible workaround is to reframe the regex to count whitespace rather than word character sequences, as shown here:
^\s*(?:\S+(?:\s+|$)){10,100}$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, Perl, PCRE, Python, Ruby |
In many cases, this will work the same as the previous solutions, although it’s not exactly equivalent. For example, one difference is that compounds joined by a hyphen, such as “far-reaching,” will now be counted as one word instead of two. The same applies to words with apostrophes, such as “don’t.”
Recipe 4.8 shows how to limit input by character set (alphanumeric, ASCII-only, etc.) instead of length.
Recipe 4.10 explains the subtleties that go into precisely limiting the number of lines in your text.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors. Recipe 2.7 explains how to match Unicode characters. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.