You need to check whether a string is comprised of five or fewer lines, without regard for how many total characters appear in the string.
The exact characters or character sequences used as line
separators can vary depending on your operating system’s convention,
application or user preferences, and so on. Crafting an ideal solution
therefore raises questions about what conventions should be supported
to indicate the start of a new line. The following solutions support
the standard MS-DOS/Windows (‹\r\n
›), legacy Mac OS (‹\r
›), and Unix/Linux/OS X (‹\n
›) line break
conventions.
The following three flavor-specific regexes contain two
differences. The first regex uses atomic groups, written as ‹(?>⋯)
›, instead of noncapturing groups,
written as ‹(?:⋯)
›, because they have
the potential to provide a minor efficiency improvement here for the
regex flavors that support them. Python and JavaScript do not
support atomic groups, so they are not used with those flavors. The
other difference is the tokens used to assert position at the
beginning and end of the string (‹\A
› or ‹^
› for the beginning of the string, and
‹\z
›, ‹\Z
›, or ‹$
› for the end). The reasons
for this variation are discussed in depth later in this recipe. All
three flavor-specific regexes match exactly the same strings:
\A(?>(?>\r\n?|\n)?[^\r\n]*){0,5}\z
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Ruby |
\A(?:(?:\r\n?|\n)?[^\r\n]*){0,5}\Z
Regex options: None |
Regex flavor: Python |
^(?:(?:\r\n?|\n)?[^\r\n]*){0,5}$
Regex options: None |
Regex flavor: JavaScript |
if (preg_match('/\A(?>(?>\r\n?|\n)?[^\r\n]*){0,5}\z/', $_POST['subject'])) { print 'Subject contains five or fewer lines'; } else { print 'Subject contains more than five lines'; }
See Recipe 3.5 for help implementing these regular expressions with other programming languages.
All of the regular expressions shown so far in this recipe use a grouping that matches an MS-DOS/Windows, legacy Mac OS, or Unix/Linux/OS X line break sequence followed by any number of non-line-break characters. The grouping is repeated between zero and five times, since we’re matching up to five lines.
In the following example, we’ve broken up the JavaScript version of the regex into its individual parts. We’ve used the JavaScript version here because its elements are probably familiar to the widest range of readers. We’ll explain the variations for alternative regex flavors afterward:
^ # Assert position at the beginning of the string. (?: # Group but don't capture... (?: # Group but don't capture... \r # Match a carriage return (CR, ASCII position 0x0D). \n # Match a line feed (LF, ASCII position 0x0A)... ? # between zero and one time. | # or... \n # Match a line feed character. ) # End the noncapturing group. ? # Repeat the preceding group between zero and one time. [^\r\n] # Match any single character except CR or LF... * # between zero and unlimited times. ) # End the noncapturing group. {0,5} # Repeat the preceding group between zero and five times. $ # Assert position at the end of the string.
Regex options: Free-spacing |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
The leading ‹^
›
matches the position at the beginning of the string. This helps to
ensure that the entire string contains no more than five lines,
because unless the regex is forced to start at the beginning of the
string, it can match any five lines within a longer string.
Next, a noncapturing group encloses the combination of a line break sequence and any number of non-line-break characters. The immediately following quantifier allows this group to repeat between zero and five times (zero repetitions would match a completely empty string). Within the outer group, an optional subgroup matches a line break sequence. Next is the character class that matches any number of non-line-break characters.
Take a close look at the order of the outer group’s elements
(first a line break, then a non-line-break sequence). If we reversed
the order so that the group was instead written as ‹(?:[^\r\n]*(?:\r\n?|\n)?)
›, a
fifth repetition would allow a trailing line break. Effectively, such
a change would allow an empty, sixth line.
The subgroup allows any of three line break sequences:
A carriage return followed by a line feed (‹
\r\n
›, the conventional MS-DOS/Windows line break sequence)A standalone carriage return (‹
\r
›, the legacy Mac OS line break character)A standalone line feed (‹
\n
›, the conventional Unix/Linux/OS X line break character)
Now let’s move on to the cross-flavor differences.
The first version of the regex (used by all flavors except Python and JavaScript) uses atomic groups rather than simple noncapturing groups. Although in some cases the use of atomic groups can have a much more profound impact, in this case they simply let the regex engine avoid a bit of unnecessary backtracking that can occur if the match attempt fails (see Recipe 2.15 for more information about atomic groups).
The other cross-flavor differences are the tokens used to assert
position at the beginning and end of the string. The breakdown shown
earlier used ‹^
› and
‹$
› for these purposes.
Although these anchors are supported by all of the regex flavors
discussed here, the alternative regexes in this section used ‹\A
›, ‹\Z
›, and ‹\z
› instead. The short explanation for this is
that the meaning of these metacharacters differs slightly between
regular expression flavors. The long explanation leads us to a bit of
regex history....
When using Perl to read a line from a file, the resulting string
ends with a line break. Hence, Perl introduced an “enhancement” to the
traditional meaning of ‹$
› that has since been copied by most regex
flavors. In addition to matching the absolute end of a string, Perl’s
‹$
› matches just before
a string-terminating line break. Perl also introduced two more
assertions that match the end of a string: ‹\Z
› and ‹\z
›. Perl’s ‹\Z
› anchor has the same quirky meaning as
‹$
›, except that it
doesn’t change when the option to let ‹^
› and ‹$
› match at line breaks is enabled. ‹\z
› always matches only the
absolute end of a string, no exceptions. Since this recipe explicitly
deals with line breaks in order to count the lines in a string, it
uses the ‹\z
› assertion
for the regex flavors that support it, to ensure that an empty, sixth
line is not allowed.
Most of the other regex flavors copied Perl’s end-of-line/string
anchors. .NET, Java, PCRE, and Ruby all support both ‹\Z
› and ‹\z
› with the same meanings as Perl. Python
includes only ‹\Z
›
(uppercase), but confusingly changes its meaning to match only the
absolute end of the string, just like Perl’s lowercase ‹\z
›. JavaScript doesn’t include
any “z” anchors, but unlike all of the other flavors discussed here,
its ‹$
› anchor matches
only at the absolute end of the string (when the option to let
‹^
› and ‹$
› match at line breaks is not
enabled).
As for ‹\A
›, the
situation is slightly better. It always matches only at the start of a
string, and it means exactly the same thing in all flavors discussed
here, except JavaScript (which doesn’t support it).
Although it’s unfortunate that these kinds of confusing cross-flavor inconsistencies exist, one of the benefits of using the regular expressions in this book is that you generally won’t need to worry about them. Gory details like the ones we’ve just described are included in case you care to dig deeper.
The previously shown regexes limit support to the conventional MS-DOS/Windows, Unix/Linux/OS X, and legacy Mac OS line break character sequences. However, there are several rarer vertical whitespace characters that you might encounter occasionally. The following regexes take these additional characters into account while limiting matches to five lines of text or less.
\A(?>\R?\V*){0,5}\z
Regex options: None |
Regex flavors: PCRE 7 (with the PCRE_BSR_UNICODE option), Perl 5.10 |
\A(?>(?>\r\n?|[\n-\f\x85\x{2028}\x{2029}])?↵ [^\n-\r\x85\x{2028}\x{2029}]*){0,5}\z
Regex options: None |
Regex flavors: PCRE, Perl |
\A(?>(?>\r\n?|[\n-\f\x85\u2028\u2029])?[^\n-\r\x85\u2028\u2029]*){0,5}\z
Regex options: None |
Regex flavors: .NET, Java, Ruby |
\A(?:(?:\r\n?|[\n-\f\x85\u2028\u2029])?[^\n-\r\x85\u2028\u2029]*){0,5}\Z
Regex options: None |
Regex flavor: Python |
^(?:(?:\r\n?|[\n-\f\x85\u2028\u2029])?[^\n-\r\x85\u2028\u2029]*){0,5}$
Regex options: None |
Regex flavor: JavaScript |
All of these regexes handle the line separators in Table 4-1, listed with their Unicode positions and names.
Table 4-1. Line separators
Unicode sequence | Regex equivalent | Name | When used |
---|---|---|---|
| ‹ | Carriage return and line feed (CRLF) | Windows and MS-DOS text files |
| ‹ | Line feed (LF) | Unix, Linux, and OS X text files |
| ‹ | Line tabulation (aka vertical tab, or VT) | (Rare) |
| ‹ | Form feed (FF) | (Rare) |
| ‹ | Carriage return (CR) | Mac OS text files |
| ‹ | Next line (NEL) | IBM mainframe text files (Rare) |
| ‹ | Line separator | (Rare) |
| ‹ | Paragraph separator | (Rare) |
Get Regular Expressions Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.