2.16. Test for a Match Without Adding It to the Overall Match
Problem
Find any word that occurs between a pair of HTML bold tags, without
including the tags in the regex match. For instance, if the subject is
My
<b>cat</b> is furry
, the only valid match
should be cat
.
Solution
(?<=<b>)\w+(?=</b>)
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
JavaScript and Ruby 1.8 support the lookahead ‹(?=</b>)
›, but not the
lookbehind ‹(?<=<b>)
›.
Discussion
Lookaround
The four kinds of lookaround groups supported by modern regex flavors have the special property of giving up the text matched by the part of the regex inside the lookaround. Essentially, lookaround checks whether certain text can be matched without actually matching it.
Lookaround that looks backward is called
lookbehind. This is the only regular expression construct that
will traverse the text from right to left instead of from left to
right. The syntax for positive
lookbehind is ‹(?<=
›. The
four characters ‹text
)(?<=
› form the opening bracket. What you
can put inside the lookbehind, here represented by ‹
›, varies among
regular expression flavors. But simple literal text, such as ‹text
(?<=<b>)
›, always
works.
Lookbehind checks to see whether the text inside the
lookbehind occurs immediately to the left of the position that the
regular expression engine has reached. If you match ‹(?<=<b>)
› against
My
<b>cat</b> is furry
, the lookbehind will fail to match until the regular expression starts the match ...
Get Regular Expressions Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.