Greedy and Non-Greedy Matches
Problem
You have a pattern with a greedy quantifier like
*
, +
, ?
, or
{}
, and you want to stop it from being greedy.
A classic case of this is the naïve substitution to remove tags
from HTML. Although it looks appealing,
s#<TT>.*</TT>##gsi
, actually deletes
everything from the first open TT
tag through the
last closing one. This would turn "Even
<TT>vi</TT>
can
edit
<TT>troff</TT>
effectively."
into "Even
effectively"
, completely changing the meaning of
the sentence!
Solution
Replace the offending greedy quantifier with the corresponding
non-greedy version. That is, change *
,
+
, ?
, and {}
into *?
, +?
,
??
, and {}?
, respectively.
Discussion
Perl has two sets of quantifiers:
the maximal ones *
,
+
, ?
, and {}
(sometimes called greedy) and the
minimal ones *?
,
+?
, ??
, and
{}?
(sometimes called
stingy). For instance, given the string
"Perl
is
a
Swiss
Army
Chainsaw!"
, the pattern
/(r.*s)/
matches "rl
is
a
Swiss
Army
Chains"
whereas
/(r.*?s)/
matches "rl
is"
.
With maximal quantifiers, when you ask to match a variable number of
times, such as zero or more times for *
or one or
more times for +
, the matching engine prefers the
“or more” portion of that description. Thus
/foo.*bar/
matches from the first
"foo"
up to the last "bar"
in
the string, rather than merely the next "bar"
, as
some might expect. To make any of the regular expression repetition
operators prefer stingy matching over greedy matching, add an extra
?
. So *?
matches zero or more times, but rather ...
Get Perl Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.