9.3. Remove All XML-Style Tags Except <em> and <strong>
Problem
You want to remove all tags in a string except <em>
and <strong>
.
In a separate case, you not only want to remove all tags other
than <em>
and <strong>
, you also want to remove
<em>
and <strong>
tags that contain
attributes.
Solution
This is a perfect setting to put negative lookahead (explained in
Recipe 2.16) to use. Applied to this problem,
negative lookahead lets you match what looks like a tag,
except when certain words come immediately after
the opening <
or </
. If you then replace all matches with an
empty string (following the code in Recipe 3.14), only the approved tags are left
behind.
Solution 1: Match tags except <em> and <strong>
</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In free-spacing mode:
< /? # Permit closing tags (?! (?: em | strong ) # List of tags to avoid matching \b # Word boundary avoids partial word matches ) [a-z] # Tag name initial character must be a-z (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* >
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes
With one change (replacing the ‹\b
› with ‹\s*>
›), you can make the regex also match any
<em>
and <strong>
tags that contain ...
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.