11.10. Converting HTML to ASCII
Problem
You need to convert HTML to readable, formatted ASCII text.
Solution
If you have access to an external program that formats HTML as ASCII, such as lynx, call it like so:
$file = escapeshellarg($file); $ascii = `lynx -dump $file`;
Discussion
If you can’t use an external formatter, the
pc_html2ascii( )
function shown in Example 11-4 handles a reasonable subset of HTML (no tables
or frames, though).
Example 11-4. pc_html2ascii( )
function pc_html2ascii($s) { // convert links $s = preg_replace('/<a\s+.*?href="?([^\" >]*)"?[^>]*>(.*?)<\/a>/i', '$2 ($1)', $s); // convert <br>, <hr>, <p>, <div> to line breaks $s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s); $s = preg_replace('@<p[^>]*>@i',"\n\n",$s); $s = preg_replace('@<div[^>]*>(.*)</div>@i',"\n".'$1'."\n",$s); // convert bold and italic $s = preg_replace('@<b[^>]*>(.*?)</b>@i','*$1*',$s); $s = preg_replace('@<i[^>]*>(.*?)</i>@i','/$1/',$s); // decode named entities $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES))); // decode numbered entities $s = preg_replace('//e','chr(\\1)',$s); // remove any remaining tags $s = strip_tags($s); return $s; }
See Also
Recipe 9.9 for more on
get_html_translation_table()
; documentation on
preg_replace( )
at
http://www.php.net/preg-replace,
get_html_translation_table( )
at
http://www.php.net/get-html-translation-table,
and strip_tags( )
at
http://www.php.net/strip-tags.
Get PHP Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.