11.10. Converting HTML to ASCII


You need to convert HTML to readable, formatted ASCII text.


If you have access to an external program that formats HTML as ASCII, such as lynx, call it like so:

$file = escapeshellarg($file);
$ascii = `lynx -dump $file`;


If you can’t use an external formatter, the pc_html2ascii( ) function shown in Example 11-4 handles a reasonable subset of HTML (no tables or frames, though).

Example 11-4. pc_html2ascii( )

function pc_html2ascii($s) {
  // convert links
  $s = preg_replace('/<a\s+.*?href="?([^\" >]*)"?[^>]*>(.*?)<\/a>/i',
                    '$2 ($1)', $s);

  // convert <br>, <hr>, <p>, <div> to line breaks
  $s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);
  $s = preg_replace('@<p[^>]*>@i',"\n\n",$s);
  $s = preg_replace('@<div[^>]*>(.*)</div>@i',"\n".'$1'."\n",$s);
  // convert bold and italic
  $s = preg_replace('@<b[^>]*>(.*?)</b>@i','*$1*',$s);
  $s = preg_replace('@<i[^>]*>(.*?)</i>@i','/$1/',$s);

  // decode named entities
  $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));

  // decode numbered entities
  $s = preg_replace('//e','chr(\\1)',$s);
  // remove any remaining tags
  $s = strip_tags($s);
  return $s;

See Also

Recipe 9.9 for more on get_html_translation_table(); documentation on preg_replace( ) at http://www.php.net/preg-replace, get_html_translation_table( ) at http://www.php.net/get-html-translation-table, and strip_tags( ) at http://www.php.net/strip-tags.

Get PHP Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.