2.16. Test for a Match Without Adding It to the Overall Match
Problem
Find any word that occurs between a pair of HTML bold
tags, without including the tags in the regex match. For instance, if
the subject is My
<b>cat</b> is furry
, the only valid match should
be cat
.
Solution
(?<=<b>)\w+(?=</b>)
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
JavaScript and Ruby 1.8 support the lookahead ‹(?=</b>)
›, but not the lookbehind ‹(?<=<b>)
›.
Discussion
Lookaround
The four kinds of lookaround groups supported by modern regex flavors have the special property of giving up the text matched by the part of the regex inside the lookaround. Essentially, lookaround checks whether certain text can be matched without actually matching it.
Lookaround that looks backward is called lookbehind. This is the only
regular expression construct that will traverse the text from right to
left instead of from left to right. The syntax for positive
lookbehind is ‹(?<=⋯)
›. The four characters ‹(?<=
›
form the opening bracket. What you can put inside the lookbehind, here
represented by ‹⋯
›, varies among regular
expression flavors. But simple literal text, such as ‹(?<=<b>)
›, always
works.
Lookbehind checks to see whether the text inside the lookbehind
occurs immediately to the left of the position that the regular
expression engine has reached. If you match ‹(?<=<b>)
› against My <b>cat</b> is
furry
, the lookbehind will fail to match until the regular expression starts the match attempt ...
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.