9.5. Convert Plain Text to HTML by Adding <p> and <br> Tags
Problem
Given a plain text string, such as a multiline value
submitted via a form, you want to convert it to an HTML fragment to
display within a web page. Paragraphs, separated by two line breaks in a
row, should be surrounded with <p>⋯</p>
. Additional
line breaks should be replaced with <br>
tags.
Solution
This problem can be solved in four simple steps. In most programming languages, only the middle two steps benefit from regular expressions.
Step 1: Replace HTML special characters with named character references
As we’re converting plain text to HTML, the first step
is to convert the three special HTML characters &
, <
, and >
to named character references (see
Table 9-3).
Otherwise, the resulting markup could lead to unintended results when
displayed in a web browser.
Table 9-3. HTML special character substitutions
Search for | Replace with |
---|---|
‹ | « |
‹ | « |
‹ | « |
Ampersands (&
) must be
replaced first, since you’ll be adding more ampersands to the subject
string as part of the named character references.
Step 2: Replace all line breaks with <br>
Search for:
\r\n?|\n
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
\R
Regex options: None |
Regex flavors: PCRE 7, Perl 5.10 |
Replace with:
<br>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP, Python, Ruby |
Step 3: Replace double <br> tags with </p><p>
Search for:
<br>\s*<br>
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, ... |
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.