4.9. Limit the Length of Text

Problem

You want to test whether a string is composed of between 1 and 10 letters from A to Z.

Solution

All the programming languages covered by this book provide a simple, efficient way to check the length of text. For example, JavaScript strings have a length property that holds an integer indicating the string’s length. However, using regular expressions to check text length can be useful in some situations, particularly when length is only one of multiple rules that determine whether the subject text fits the desired pattern. The following regular expression ensures that text is between 1 and 10 characters long, and additionally limits the text to the uppercase letters A–Z. You can modify the regular expression to allow any minimum or maximum text length, or allow characters other than A–Z.

Regular expression

^[A-Z]{1,10}$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Perl example

if ($ARGV[0] =~ /^[A-Z]{1,10}$/) {
    print "Input is valid\n";
} else {
    print "Input is invalid\n";
}

See Recipe 3.6 for help with implementing this regular expression with other programming languages.

Discussion

Here’s the breakdown for this very straightforward regex:

^         # Assert position at the beginning of the string.
[A-Z]     # Match one letter from A to Z
  {1,10}  #   between 1 and 10 times.
$         # Assert position at the end of the string.
Regex options: Free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

The ^ and $ anchors ensure that the regex matches the entire subject string; otherwise, it could match 10 characters within longer text. The [A-Z] character class matches any single uppercase character from A to Z, and the interval quantifier {1,10} repeats the character class from 1 to 10 times. By combining the interval quantifier with the surrounding start- and end-of-string anchors, the regex will fail to match if the subject text’s length falls outside the desired range.

Note that the character class [A-Z] explicitly allows only uppercase letters. If you want to also allow the lowercase letters a to z, you can either change the character class to [A-Za-z] or apply the case insensitive option. Recipe 3.4 shows how to do this.

Tip

A mistake commonly made by new regular expression users is to try to save a few characters by using the character class range [A-z]. At first glance, this might seem like a clever trick to allow all uppercase and lowercase letters. However, the ASCII character table includes several punctuation characters in positions between the A–Z and a–z ranges. Hence, [A-z] is actually equivalent to [A-Z[\]^_`a-z].

Variations

Limit the length of an arbitrary pattern

Because quantifiers such as {1,10} apply only to the immediately preceding element, limiting the number of characters that can be matched by patterns that include more than a single token requires a different approach.

As explained in Recipe 2.16, lookaheads (and their counterpart, lookbehinds) are a special kind of assertion that, like ^ and $, match a position within the subject string and do not consume any characters. Lookaheads can be either positive or negative, which means they can check if a pattern follows or does not follow the current position in the match. A positive lookahead, written as (?=), can be used at the beginning of the pattern to ensure that the string is within the target length range. The remainder of the regex can then validate the desired pattern without worrying about text length. Here’s a simple example:

^(?=.{1,10}$).*
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
^(?=[\S\s]{1,10}$)[\S\s]*
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

It is important that the $ anchor appears inside the lookahead because the maximum length test works only if we ensure that there are no more characters after we’ve reached the limit. Because the lookahead at the beginning of the regex enforces the length range, the following pattern can then apply any additional validation rules. In this case, the pattern .* (or [\S\s]* in the version that adds native JavaScript support) is used to simply match the entire subject text with no added constraints.

The first regex uses the “dot matches line breaks” option so that it will work correctly when your subject string contains line breaks. See Recipe 3.4 for details about how to apply this modifier with your programming language. Standard JavaScript without XRegExp doesn’t have a “dot matches line breaks” option, so the second regex uses a character class that matches any character. See Any character including line breaks for more information.

Limit the number of nonwhitespace characters

The following regex matches any string that contains between 10 and 100 nonwhitespace characters:

^\s*(?:\S\s*){10,100}$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

By default, \s in .NET, JavaScript, Perl, and Python 3.x matches all Unicode whitespace, and \S matches everything else. In Java, PCRE, Python 2.x, and Ruby, \s matches ASCII whitespace only, and \S matches everything else. In Python 2.x, you can make \s match all Unicode whitespace by passing the UNICODE or U flag when creating the regex. In Java 7, you can make \s match all Unicode whitespace by passing the UNICODE_CHARACTER_CLASS flag. Developers using Java 4 to 6, PCRE, and Ruby 1.9 who want to avoid having any Unicode whitespace count against their character limit can switch to the following version of the regex that takes advantage of Unicode categories (described in Recipe 2.7):

^[\p{Z}\s]*(?:[^\p{Z}\s][\p{Z}\s]*){10,100}$
Regex options: None
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Ruby 1.9

PCRE must be compiled with UTF-8 support for this to work. In PHP, turn on UTF-8 support with the /u pattern modifier.

This latter regex combines the Unicode \p{Z} Separator property with the \s shorthand for whitespace. That’s because the characters matched by \p{Z} and \s do not completely overlap. \s includes the characters at positions 0x09 through 0x0D (tab, line feed, vertical tab, form feed, and carriage return), which are not assigned the Separator property by the Unicode standard. By combining \p{Z} and \s in a character class, you ensure that all whitespace characters are matched.

In both regexes, the interval quantifier {10,100} is applied to the noncapturing group that precedes it, rather than a single token. The group matches any single nonwhitespace character followed by zero or more whitespace characters. The interval quantifier can reliably track how many nonwhitespace characters are matched because exactly one nonwhitespace character is matched during each iteration.

Limit the number of words

The following regex is very similar to the previous example of limiting the number of nonwhitespace characters, except that each repetition matches an entire word rather than a single nonwhitespace character. It matches between 10 and 100 words, skipping past any nonword characters, including punctuation and whitespace:

^\W*(?:\w+\b\W*){10,100}$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In Java 4 to 6, JavaScript, PCRE, Python 2.x, and Ruby, the word character token \w in this regex will match only the ASCII characters A–Z, a–z, 0–9, and _, and therefore this cannot correctly count words that contain non-ASCII letters and numbers. In .NET and Perl, \w is based on the Unicode table (as is its inverse, \W, and the word boundary \b) and will match letters and digits from all Unicode scripts. In Python 2.x, you can choose to make these tokens Unicode-based by passing the UNICODE or U flag when creating the regex. In Python 3.x, they are Unicode-based by default. In Java 7, you can choose to make the shorthands for word and nonword characters Unicode-based by passing the UNICODE_CHARACTER_CLASS flag. Java’s \b is always Unicode-based.

If you want to count words that contain non-ASCII letters and numbers, the following regexes provide this capability for additional regex flavors:

^[^\p{L}\p{M}\p{Nd}\p{Pc}]*(?:[\p{L}\p{M}\p{Nd}\p{Pc}]+↵
\b[^\p{L}\p{M}\p{Nd}\p{Pc}]*){10,100}$
Regex options: None
Regex flavors: .NET, Java, Perl
^[^\p{L}\p{M}\p{Nd}\p{Pc}]*(?:[\p{L}\p{M}\p{Nd}\p{Pc}]+↵
(?:[^\p{L}\p{M}\p{Nd}\p{Pc}]+|$)){10,100}$
Regex options: None
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Ruby 1.9

PCRE must be compiled with UTF-8 support for this to work. In PHP, turn on UTF-8 support with the /u pattern modifier.

As noted, the reason for these different (but equivalent) regexes is the varying handling of the word character and word boundary tokens, explained more fully in Word Characters.

The last two regexes use character classes that include the separate Unicode categories for letters, marks (necessary for matching words of many languages), decimal numbers, and connector punctuation (the underscore and similar characters), which makes them equivalent to the earlier regex that used \w and \W.

Each repetition of the noncapturing group in the first two of these three regexes matches an entire word followed by zero or more nonword characters. The \W (or [^\p{L}\p{M}\p{Nd}\p{Pc}]) token inside the group is allowed to repeat zero times in case the string ends with a word character. However, since this effectively makes the nonword character sequence optional throughout the matching process, the word boundary assertion \b is needed between \w and \W (or [\p{L}\p{M}\p{Nd}\p{Pc}] and [^\p{L}\p{M}\p{Nd}\p{Pc}]), to ensure that each repetition of the group really matches an entire word. Without the word boundary, a single repetition would be allowed to match any part of a word, with subsequent repetitions matching additional pieces.

The third version of the regex (which adds support for XRegExp, PCRE, and Ruby 1.9) works a bit differently. It uses a plus (one or more) instead of an asterisk (zero or more) quantifier, and explicitly allows matching zero characters only if the matching process has reached the end of the string. This allows us to avoid the word boundary token, which is necessary to ensure accuracy since \b is not Unicode-enabled in XRegExp, PCRE, or Ruby. \b is Unicode-enabled in Java, even though Java’s \w is not (unless you use the UNICODE_CHARACTER_CLASS flag in Java 7).

Unfortunately, none of these options allow standard JavaScript or Ruby 1.8 to correctly handle words that use non-ASCII characters. A possible workaround is to reframe the regex to count whitespace rather than word character sequences, as shown here:

^\s*(?:\S+(?:\s+|$)){10,100}$
Regex options: None
Regex flavors: .NET, Java, JavaScript, Perl, PCRE, Python, Ruby

In many cases, this will work the same as the previous solutions, although it’s not exactly equivalent. For example, one difference is that compounds joined by a hyphen, such as “far-reaching,” will now be counted as one word instead of two. The same applies to words with apostrophes, such as “don’t.”

See Also

Recipe 4.8 shows how to limit input by character set (alphanumeric, ASCII-only, etc.) instead of length.

Recipe 4.10 explains the subtleties that go into precisely limiting the number of lines in your text.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors. Recipe 2.7 explains how to match Unicode characters. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.