Chapter 9. HTML Processing with Trees
Treating HTML as a stream of tokens is an imperfect solution to the problem of extracting information from HTML. In particular, the token model obscures the hierarchical nature of markup. Nested structures such as lists within lists or tables within tables are difficult to process as just tokens. Such structures are best represented as trees, and the HTML::Element class does just this.
This chapter teaches you how to use the HTML::TreeBuilder module to construct trees from HTML, and how to process those trees to extract information. Chapter 10 shows how to modify HTML using trees.
Introduction to Trees
The HTML in Example 9-1 can be represented by the tree in Figure 9-1.
<ul> <li>Ice cream.</li> <li>Whipped cream. <li>Hot apple pie <br>(mmm pie)</li> </ul>
In the language of trees, each part of the tree (such as html
, li
,
Ice
cream.
, and br)
is a node. There are
two kinds of nodes in an HTML tree: text nodes,which are
strings with no tags, and elements, which symbolize
not mere strings, but things that can have attributes (such as align=left
), and which generally came from an
open tag (such as <li>
), and
were possibly closed by an end-tag (such as </li>
).
When several nodes are contained by another, as the li
elements are
contained by the ul
element, the
contained ones are called children. Children ...
Get Perl & LWP now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.