BUY THIS BOOK
Add to Cart

Print Book $34.95


Add to Cart

Print+PDF $45.44

Add to Cart

PDF $27.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £24.95

What is this?

Looking to Reprint or License this content?


Perl & LWP
Perl & LWP

By Sean M. Burke
Book Price: $34.95 USD
£24.95 GBP
PDF Price: $27.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction to Web Automation
LWP (short for "Library for World Wide Web in Perl") is a set of Perl modules and object-oriented classes for getting data from the Web and for extracting information from HTML. This chapter provides essential background on the LWP suite. It describes the nature and history of LWP, which platforms it runs on, and how to download and install it. This chapter ends with a quick walkthrough of several LWP programs that illustrate common tasks, such as fetching web pages, extracting information using regular expressions, and submitting forms.
Most web sites are designed for people. User Interface gurus consult for large sums of money to build HTML code that is easy to use and displays correctly on all browsers. User Experience gurus wag their fingers and tell web designers to study their users, so they know the human foibles and desires of the ape descendents who will be viewing the web site.
Fundamentally, though, a web site is home to data and services. A stockbroker has stock prices and the value of your portfolio (data) and forms that let you buy and sell stock (services). Amazon has book ISBNs, titles, authors, reviews, prices, and rankings (data) and forms that let you order those books (services).
It's assumed that the data and services will be accessed by people viewing the rendered HTML. But many a programmer has eyed those data sources and services on the Web and thought "I'd like to use those in a program!" For example, they could page you when your portfolio falls past a certain point or could calculate the "best" book on Perl based on the ratio of its price to its average reader review.
LWP lets you do this kind of web automation. With it, you can fetch web pages, submit forms, authenticate, and extract information from HTML. Once you've used it to grab news headlines or check links, you'll never view the Web in the same way again.
As with everything in Perl, there's more than one way to automate accessing the Web. In this book, we'll show you everything from the basic way to access the Web (via the LWP::Simple module), through forms, all the way to the gory details of cookies, authentication, and other types of complex requests.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Web as Data Source
Most web sites are designed for people. User Interface gurus consult for large sums of money to build HTML code that is easy to use and displays correctly on all browsers. User Experience gurus wag their fingers and tell web designers to study their users, so they know the human foibles and desires of the ape descendents who will be viewing the web site.
Fundamentally, though, a web site is home to data and services. A stockbroker has stock prices and the value of your portfolio (data) and forms that let you buy and sell stock (services). Amazon has book ISBNs, titles, authors, reviews, prices, and rankings (data) and forms that let you order those books (services).
It's assumed that the data and services will be accessed by people viewing the rendered HTML. But many a programmer has eyed those data sources and services on the Web and thought "I'd like to use those in a program!" For example, they could page you when your portfolio falls past a certain point or could calculate the "best" book on Perl based on the ratio of its price to its average reader review.
LWP lets you do this kind of web automation. With it, you can fetch web pages, submit forms, authenticate, and extract information from HTML. Once you've used it to grab news headlines or check links, you'll never view the Web in the same way again.
As with everything in Perl, there's more than one way to automate accessing the Web. In this book, we'll show you everything from the basic way to access the Web (via the LWP::Simple module), through forms, all the way to the gory details of cookies, authentication, and other types of complex requests.
Once you've tackled the fundamentals of how to ask a web server for a particular page, you still have to find the information you want, buried in the HTML response. Most often you won't need more than regular expressions to achieve this. Chapter 6 describes the art of extracting information from HTML using regular expressions, although you'll see the beginnings of it as early as Chapter 2, where we query AltaVista for a word, and use a regexp to match the number in the response that says "We found
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
History of LWP
The following history of LWP was written by Gisle Aas, one of the creators of LWP and its current maintainer.
The libwww-perl project was started at the very first WWW conference held in Geneva in 1994. At the conference, Martijn Koster met Roy Fielding who was presenting the work he had done on MOMspider. MOMspider was a Perl program that traversed the Web looking for broken links and built an index of the documents and links discovered. Martijn suggested turning the reusable components of this program into a library. The result was the libwww-perl library for Perl 4 that Roy maintained.
Later the same year, Larry Wall made the first "stable" release of Perl 5 available. It was obvious that the module system and object-oriented features that the new version of Perl provided make Roy's library even better. At one point, both Martijn and myself had made our own separate modifications of libwww-perl. We joined forces, merged our designs, and made several alpha releases. Unfortunately, Martijn ended up in disagreement with his employer about the intellectual property rights of work done outside hours. To safeguard the code's continued availability to the Perl community, he asked me to take over maintenance of it.
The LWP:: module namespace was introduced by Martijn in one of the early alpha releases. This name choice was lively discussed on the libwww mailing list. It was soon pointed out that this name could be confused with what certain implementations of threads called themselves, but no better name alternatives emerged. In the last message on this matter, Martijn concluded, "OK, so we all agree LWP stinks :-)." The name stuck and has established itself.
If you search for "LWP" on Google today, you have to go to 30th position before you find a link about threads.
In May 1996, we made the first non-beta release of libwww-perl for Perl 5. It was called release 5.00 because it was for Perl 5. This made some room for Roy to maintain libwww-perl for Perl 4, called libwww-perl-0.40. Martijn continued to contribute but was unfortunately "rolled over by the Java train."
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Installing LWP
LWP and the associated modules are available in various distributions free from the Comprehensive Perl Archive Network (CPAN). The main distributions are listed at the start of Appendix A, although the details of which modules are in which distributions change occasionally.
If you're using ActivePerl for Windows or MacPerl for Mac OS 9, you already have LWP. If you're on Unix and you don't already have LWP installed, you'll need to install it from CPAN using instructions given in the next section.
To test whether you already have LWP installed:
% perl -MLWP -le "print(LWP->VERSION)"
(The second character in -le is a lowercase L, not a digit one.)
If you see:
Can't locate LWP in @INC (@INC contains: ...lots of paths...).
BEGIN failed--compilation aborted.
or if you see a version number lower than 5.64, you need to install LWP on your system.
There are two ways to install modules: using the CPAN shell or the old-fashioned manual way.
The CPAN shell is a command-line environment for automatically downloading, building, and installing modules from CPAN.

Section 1.3.1.1: Configuring

If you have never used the CPAN shell, you will need to configure it before you can use it. It will prompt you for some information before building its configuration file.
Invoke the CPAN shell by entering the following command at a system shell prompt:
% perl -MCPAN -eshell
If you've never run it before, you'll see this:
We have to reconfigure CPAN.pm due to following uninitialized parameters:
followed by a number of questions. For each question, the default answer is typically fine, but you may answer otherwise if you know that the default setting is wrong or not optimal. Once you've answered all the questions, a configuration file is created and you can start working with the CPAN shell.

Section 1.3.1.2: Obtaining help

If you need help at any time, you can read the CPAN shell's manual page by typing perldoc CPAN or by starting up the CPAN shell (with perl -MCPAN -eshell at a system shell prompt) and entering
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Words of Caution
In theory, the underlying mechanisms of the Web make no difference between a browser getting data and displaying it to you, and your LWP-based program getting data and doing something else with it. However, in practice, almost all the data on the Web was put there with the assumption (sometimes implicit, sometimes explicit) that it would be looked at directly in a browser. When you write an LWP program that downloads that data, you are working against that assumption. The trick is to do this in as considerate a way as possible.
When you access a web server, you are using scarce resources. You are using your bandwidth and the web server's bandwidth. Moreover, processing your request places a load on the remote server, particularly if the page you're requesting has to be dynamically generated, and especially if that dynamic generation involves database access. If you're writing a program that requests several pages from a given server but you don't need the pages immediately, you should write delays into your program (such as sleep 60; to sleep for one minute), so that the load that you're placing on the network and on the web server is spread unobtrusively over a longer period of time.
If possible, you might even want to consider having your program run in the middle of the night (modulo the relevant time zones), when network usage is low and the web server is not likely to be busy handling a lot of requests. Do this only if you know there is no risk of your program behaving unpredictably. In Chapter 12 , we discuss programs with definite risk of that happening; do not let such programs run unattended until you have added appropriate safeguards and carefully checked that they behave as you expect them to.
While the complexities of national and international copyright law can't be covered in a page or two (or even a library or two), the short story is that just because you can get some data off the Web doesn't mean you can do whatever you want with it. The things you do with data on the Web form a continuum, as far as their relation to copyright law. At the one end is direct use, where you sit at your browser, downloading and reading pages as the site owners clearly intended. At the other end is illegal use, where you run a program that hammers a remote server as it copies and saves copyrighted data that was not meant for free public consumption, then saves it all to your public web server, which you then encourage people to visit so that you can make money off of the ad banners you've put there. Between these extremes, there are many gray areas involving considerations of "fair use," a tricky concept. The safest guide in trying to stay on the right side of copyright law is to ask, by using the data this way, could I possibly be depriving the original web site of some money that it would/could otherwise get?
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
LWP in Action
Enough of why you should be careful when you automate the Web. Let's look at the types of things you'll be learning in this book. Chapter 2 introduces web automation and LWP, presenting straightforward functions to let you fetch web pages. Example 1-1 shows how to fetch the O'Reilly home page and count the number of times Perl is mentioned.
Example 1-1. Count "Perl" in the O'Reilly catalog
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
  
my $catalog = get("http://www.oreilly.com/catalog");
my $count = 0;
$count++ while $catalog =~ m{Perl}gi;
print "$count\n";
The LWP::Simple module's get( ) function returns the document at a given URL or undef if an error occurred. A regular expression match in a loop counts the number of occurrences.
Chapter 3 goes beyond LWP::Simple to show larger LWP's powerful object-oriented interface. Most useful of all the features it covers are how to set headers in requests and check the headers of responses. Example 1-2 prints the identifying string that every server returns.
Example 1-2. Identify a server
#!/usr/bin/perl -w
use strict;
use LWP;
  
my $browser = LWP::UserAgent->new(  );
my $response = $browser->get("http://www.oreilly.com/");
print $response->header("Server"), "\n";
The two variables, $browser and $response, are references to objects. LWP::UserAgent object $browser makes requests of a server and creates HTTP::Response objects such as $response to represent the server's reply. In Example 1-2, we call the header( ) method on the response to check one of the HTTP header values.
Chapter 5 shows how to analyze and submit forms with LWP, including both GET and POST submissions. Example 1-3 makes queries of the California license plate database to see whether a personalized plate is available.
Example 1-3. Query California license plate database
#!/usr/bin/perl -w
# pl8.pl -  query California license plate database
 
use strict;
use LWP::UserAgent;
my $plate = $ARGV[0] || die "Plate to search for?\n";
$plate = uc $plate;
$plate =~ tr/O/0/;  # we use zero for letter-oh
die "$plate is invalid.\n"
 unless $plate =~ m/^[A-Z0-9]{2,7}$/
    and $plate !~ m/^\d+$/;  # no all-digit plates
 
my $browser = LWP::UserAgent->new;
my $response = $browser->post(
  'http://plates.ca.gov/search/search.php3',
  [
    'plate'  => $plate,
    'search' => 'Check Plate Availability'
  ],
);
die "Error: ", $response->status_line
 unless $response->is_success;
 
if($response->content =~ m/is unavailable/) {
  print "$plate is already taken.\n";
} elsif($response->content =~ m/and available/) {
  print "$plate is AVAILABLE!\n";
} else {
  print "$plate... Can't make sense of response?!\n";
}
exit;
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Web Basics
Three things made the Web possible: HTML for encoding documents, HTTP for transferring them, and URLs for identifying them. To fetch and extract information from web pages, you must know all three—you construct a URL for the page you wish to fetch, make an HTTP request for it and decode the HTTP response, then parse the HTML to extract information. This chapter covers the construction of URLs and the concepts behind HTTP. HTML parsing is tricky and gets its own chapters later, as does the module that lets you manipulate URLs.
You'll also learn how to automate the most basic web tasks with the LWP::Simple module. As its name suggests, this module has a very simple interface. You'll learn the limitations of that interface and see how to use other LWP modules to fetch web pages without the limitations of LWP::Simple.
A Uniform Resource Locator (URL) is the address of something on the Web. For example:
http://www.oreilly.com/news/bikeweek_day1.html
URLs have a structure, given in RFC 2396. That RFC runs to 40 pages, largely because of the wide variety of things for which you can construct URLs. Because we are interested only in HTTP and FTP URLs, the components of a URL, with the delimiters that separate them, are:
            scheme://username@server:port/path?query
         
In the case of our example URL, the scheme is http, the server is www.oreilly.com, and the path is /news/bikeweek_day1.html.
This is an FTP URL:
ftp://ftp.is.co.za/rfc/rfc1808.txt
The scheme is ftp, the host is ftp.is.co.za, and the path is /rfc/rfc1808.txt. The scheme and the hostname are not case sensitive, but the rest is. That is, ftp://ftp.is.co.za/rfc/rfc1808.txt and fTp://ftp.Is.cO.ZA/rfc/rfc1808.txt are the same, but ftp://ftp.is.co.za/rfc/rfc1808.txt and ftp://ftp.is.co.za/rfc/RFC1808.txt are not, unless that server happens to forgive case differences in requests.
We're ignoring the URLs that don't designate things that a web client can retrieve. For example, telnet://melvyl.ucop.edu/ designates a host with which you can start a Telnet session, and
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
URLs
A Uniform Resource Locator (URL) is the address of something on the Web. For example:
http://www.oreilly.com/news/bikeweek_day1.html
URLs have a structure, given in RFC 2396. That RFC runs to 40 pages, largely because of the wide variety of things for which you can construct URLs. Because we are interested only in HTTP and FTP URLs, the components of a URL, with the delimiters that separate them, are:
            scheme://username@server:port/path?query
         
In the case of our example URL, the scheme is http, the server is www.oreilly.com, and the path is /news/bikeweek_day1.html.
This is an FTP URL:
ftp://ftp.is.co.za/rfc/rfc1808.txt
The scheme is ftp, the host is ftp.is.co.za, and the path is /rfc/rfc1808.txt. The scheme and the hostname are not case sensitive, but the rest is. That is, ftp://ftp.is.co.za/rfc/rfc1808.txt and fTp://ftp.Is.cO.ZA/rfc/rfc1808.txt are the same, but ftp://ftp.is.co.za/rfc/rfc1808.txt and ftp://ftp.is.co.za/rfc/RFC1808.txt are not, unless that server happens to forgive case differences in requests.
We're ignoring the URLs that don't designate things that a web client can retrieve. For example, telnet://melvyl.ucop.edu/ designates a host with which you can start a Telnet session, and mailto:mojo@jojo.int designates an email address to which you can send.
The only characters allowed in the path portions of a URL are the US-ASCII characters A through Z, a through z, and 0-9 (but excluding extended ASCII characters such as ü and Unicode characters such as Ω or ), and these permitted punctuation characters:
-     _     .     !     ~     *     '     ,
:     @     &     +     $     (     )     /
For a query component, the same rule holds, except that the only punctuation characters allowed are these:
-     _     .     !     ~     *     '     (     )
Any other characters must be URL encoded, i.e., expressed as a percent sign followed by the two hexadecimal digits for that character. So if you wanted to use a space in a URL, it would have to be expressed as %20, because space is character 32 in ASCII, and the number 32 expressed in hexadecimal is 20.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
An HTTP Transaction
The Hypertext Transfer Protocol (HTTP) is used to fetch most documents on the Web. It is formally specified in RFC 2616, but this section explains everything you need to know to use LWP.
HTTP is a server/client protocol: the server has the file, and the client wants it. In regular web surfing, the client is a web browser such as Mozilla or Internet Explorer. The URL for a document identifies the server, which the browser contacts and requests the document from. The server returns either in error ("file not found") or success (in which case the document is attached).
Example 2-1 contains a sample request from a client.
Example 2-1. An HTTP request
GET /daily/2001/01/05/1.html HTTP/1.1
Host: www.suck.com
User-Agent: Super Duper Browser 14.6
blank line
            
A successful response is given in Example 2-2.
Example 2-2. A successful HTTP response
HTTP/1.1 200 OK
Content-type: text/html
Content-length: 24204
blank line
               and then 24,204 bytes of HTML code
            
A response indicating failure is given in Example 2-3.
Example 2-3. An unsuccessful HTTP response
HTTP/1.1 404 Not Found
Content-type: text/html
Content-length: 135
  
<html><head><title>Not Found</title></head><body>
Sorry, the object you requested was not found.
</body><html>
and then the server closes the connection
            
An HTTP request has three parts: the request line, the headers, and the body of the request (normally used to pass form parameters).
The request line says what the client wants to do (the method), what it wants to do it to (the path), and what protocol it's speaking. Although the HTTP standard defines several methods, the most common are GET and POST. The path is part of the URL being requested (in Example 2-1 the path is /daily/2001/01/05/1.html). The protocol version is generally HTTP/1.1.
Each header line consists of a key and a value (for example, User-Agent: SuperDuperBrowser/14.6). In versions of HTTP previous to 1.1, header lines were optional. In HTTP 1.1, the Host: header must be present, to name the server to which the browser is talking. This is the "server" part of the URL being requested (e.g.,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
LWP::Simple
GET is the simplest and most common type of HTTP request. Form parameters may be supplied in the URL, but there is never a body to the request. The LWP::Simple module has several functions for quickly fetching a document with a GET request. Some functions return the document, others save or print the document.
The LWP::Simple module's get( ) function takes a URL and returns the body of the document:
$document = get("http://www.suck.com/daily/2001/01/05/1.html");
If the document can't be fetched, get( ) returns undef. Incidentally, if LWP requests that URL and the server replies that it has moved to some other URL, LWP requests that other URL and returns that.
With LWP::Simple's get( ) function, there's no way to set headers to be sent with the GET request or get more information about the response, such as the status code. These are important things, because some web servers have copies of documents in different languages and use the HTTP language header to determine which document to return. Likewise, the HTTP response code can let us distinguish between permanent failures (e.g., "404 Not Found") and temporary failures ("505 Service [Temporarily] Unavailable").
Even the most common type of nontrivial web robot (a link checker), benefits from access to response codes. A 403 ("Forbidden," usually because of file permissions) could be automatically corrected, whereas a 404 ("Not Found") error implies an out-of-date link that requires fixing. But if you want access to these codes or other parts of the response besides just the main content, your task is no longer a simple one, and so you shouldn't use LWP::Simple for it. The "simple" in LWP::Simple refers not just to the style of its interface, but also to the kind of tasks for which it's meant.
One way to get the status code is to use LWP::Simple's getstore( ) function, which writes the document to a file and returns the status code from the response:
$status = getstore("http://www.suck.com/daily/2001/01/05/1.html",
                   "/tmp/web.html");
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Fetching Documents Without LWP::Simple
LWP::Simple is convenient but not all powerful. In particular, we can't make POST requests or set request headers or query response headers. To do these things, we need to go beyond LWP::Simple.
The general all-purpose way to do HTTP GET queries is by using the do_GET( ) subroutine shown in Example 2-5.
Example 2-5. The do_GET subroutine
use LWP;
my $browser;
sub do_GET {
  # Parameters: the URL,
  #  and then, optionally, any header lines: (key,value, key,value)
  $browser = LWP::UserAgent->new(  ) unless $browser;
  my $resp = $browser->get(@_);
  return ($resp->content, $resp->status_line, $resp->is_success, $resp)
    if wantarray;
  return unless $resp->is_success;
  return $resp->content;
}
A full explanation of the internals of do_GET( ) is given in Chapter 3. Until then, we'll be using it without fully understanding how it works.
You can call the do_GET( ) function in either scalar or list context:
            doc = do_GET(URL [header, value, ...]);
(doc, status, successful, response) = do_GET(URL [header, value, ...]);
In scalar context, it returns the document or undef if there is an error. In list context, it returns the document (if any), the status line from the HTTP response, a Boolean value indicating whether the status code indicates a successful response, and an object we can interrogate to find out more about the response.
Recall that assigning to undef discards that value. For example, this is how you fetch a document into a string and learn whether it is successful:
($doc, undef, $successful, undef) = do_GET('http://www.suck.com/');
The optional header and value arguments to do_GET( ) let you add headers to the request. For example, to attempt to fetch the German language version of the European Union home page:
$body = do_GET("http://europa.eu.int/",
  "Accept-language" => "de",
);
The do_GET( ) function that we'll use in this chapter provides the same basic convenience as LWP::Simple's get( ) but without the limitations.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Example: AltaVista
Every so often, two people, somewhere, somehow, will come to argue over a point of English spelling—one of them will hold up a dictionary recommending one spelling, and the other will hold up a dictionary recommending something else. In olden times, such conflicts were tidily settled with a fight to the death, but in these days of overspecialization, it is common for one of the spelling combatants to say "Let's ask a linguist. He'll know I'm right and you're wrong!" And so I am contacted, and my supposedly expert opinion is requested. And if I happen to be answering mail that month, my response is often something like:
Dear Mr. Hing:
I have read with intense interest your letter detailing your struggle with the question of whether your favorite savory spice should be spelled in English as "asafoetida" or whether you should heed your secretary's admonishment that all the kids today are spelling it "asafetida."
I could note various factors potentially involved here; notably, the fact that in many cases, British/Commonwealth spelling retains many "ae"/"oe" digraphs whereas U.S./Canadian spelling strongly prefers an "e" ("foetus"/"fetus," etc.). But I will instead be (merely) democratic about this and note that if you use AltaVista (http://altavista.com, a well-known search engine) to run a search on "asafetida," it will say that across all the pages that AltaVista has indexed, there are "about 4,170" matched; whereas for "asafoetida" there are many more, "about 8,720."
So you, with the "oe," are apparently in the majority.
To automate the task of producing such reports, I've written a small program called alta_count, which queries AltaVista for each term given and reports the count of documents matched:
% alta_count asafetida asafoetida
asafetida: 4,170 matches
            asafoetida: 8,720 matches
         
At time of this writing, going to http://altavista.com, putting a word or phrase in the search box, and hitting the Submit button yields a result page with a URL that looks like this:
http://www.altavista.com/sites/search/web?q=%22asafetida%22&kl=XX
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
HTTP POST
Some forms use GET to submit their parameters to the server, but many use POST. The difference is POST requests pass the parameters in the body of the request, whereas GET requests encode the parameters into the URL being requested.
Babelfish (http://babelfish.altavista.com) is a service that lets you translate text from one human language into another. If you're accessing Babelfish from a browser, you see an HTML form where you paste in the text you want translated, specify the language you want it translated from and to, and hit Translate. After a few seconds, a new page appears, with your translation.
Behind the scenes, the browser takes the key/value pairs in the form:
urltext = I like pie
lp = en_fr
enc = utf8
and rolls them into a HTTP request:
POST /translate.dyn HTTP/1.1
Host: babelfish.altavista.com
User-Agent: SuperDuperBrowser/14.6
Content-Type: application/x-www-form-urlencoded
Content-Length: 40
  
urltext=I%20like%20pie&lp=en_fr&enc=utf8
Just as we used a do_GET( ) function to automate a GET query, Example 2-7 uses a do_POST( ) function to automate POST queries.
Example 2-7. The do_POST subroutine
use LWP;
my $browser;
sub do_POST {
  # Parameters:
  #  the URL,
  #  an arrayref or hashref for the key/value pairs,
  #  and then, optionally, any header lines: (key,value, key,value)
  $browser = LWP::UserAgent->new(  ) unless $browser;
  my $resp = $browser->post(@_);
  return ($resp->content, $resp->status_line, $resp->is_success, $resp)
    if wantarray;
  return unless $resp->is_success;
  return $resp->content;
}
Use do_POST( ) like this:
            doc = do_POST(URL, [form_ref, [headers_ref]]);
(doc, status, success, resp) = do_GET(URL, [form_ref, [headers_ref]]);
The return values in scalar and list context are as for do_GET( ). The form_ref parameter is a reference to a hash containing the form parameters. The headers_ref parameter is a reference to a hash containing headers you want sent in the request.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Example: Babelfish
Submitting a POST query to Babelfish is as simple as:
my ($content, $message, $is_success) = do_POST(
  'http://babelfish.altavista.com/translate.dyn',
  [ 'urltext' => "I like pie", 'lp' => "en_fr", 'enc' => 'utf8' ],
);
If the request succeeded ($is_success will tell us this), $content will be an HTML page that contains the translation text. At time of this writing, the translation is inside the only textarea element on the page, so it can be extracted with just this regexp:
$content =~ m{<textarea.*?>(.*?)</textarea>}is;
The translated text is now in $1, if the match succeeded.
Knowing this, it's easy to wrap this whole procedure up in a function that takes the text to translate and a specification of what language from and to, and returns the translation. Example 2-8 is such a function.
Example 2-8. Using Babelfish to translate
sub translate {
  my ($text, $language_path) = @_;

  my ($content, $message, $is_success) = do_POST(
    'http://babelfish.altavista.com/translate.dyn',
    [ 'urltext' => $text, 'lp' => $language_path, 'enc' => 'utf8' ],
  );
  die "Error in translation $language_path: $message\n"
   unless $is_success;

  if ($content =~ m{<textarea.*?>(.*?)</textarea>}is) {
    my $translation;
    $translation = $1;
    # Trim whitespace:
    $translation =~ s/\s+/ /g;
    $translation =~ s/^ //s;
    $translation =~ s/ $//s;
    return $translation;
  } else {
    die "Can't find translation in response to $language_path";
  }
}
The translate( ) subroutine constructs the request and extracts the translation from the response, cleaning up any whitespace that may surround it. If the request couldn't be completed, the subroutine throws an exception by calling die( ).
The translate( ) subroutine could be used to automate on-demand translation of important content from one language to another. But machine translation is still a fairly new technology, and the real value of it is to be found in translating from English into another language and then back into English, just for fun. (Incidentally, there's a CPAN module that takes care of all these details for you, called Lingua::Translate, but here we're interested in how to carry out the task, rather than whether someone's already figured it out and posted it to CPAN.)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: The LWP Class Model
For full access to every part of an HTTP transaction—request headers and body, response status line, headers and body—you have to go beyond LWP::Simple, to the object-oriented modules that form the heart of the LWP suite. This chapter introduces the classes that LWP uses to represent browser objects (which you use for making requests) and response objects (which are the result of making a request). You'll learn the basic mechanics of customizing requests and inspecting responses, which we'll use in later chapters for cookies, language selection, spidering, and more.
In LWP's object model, you perform GET, HEAD, and POST requests via a browser object (a.k.a. a user agent object) of class LWP::UserAgent, and the result is an HTTP response of the aptly named class HTTP::Response. These are the two main classes, with other incidental classes providing features such as cookie management and user agents that act as spiders. Still more classes deal with non-HTTP aspects of the Web, such as HTML. In this chapter, we'll deal with the classes needed to perform web requests.
The classes can be loaded individually:
use LWP::UserAgent;
use HTTP::Response;
But it's easiest to simply use the LWP convenience class, which loads LWP::UserAgent and HTTP::Response for you:
use LWP;               # same as previous two lines
If you're familiar with object-oriented programming in Perl, the LWP classes will hold few real surprises for you. All you need is to learn the names of the basic classes and accessors. If you're not familiar with object-oriented programming in any language, you have some catching up to do. Appendix G will give you a bit of conceptual background on the object-oriented approach to things. To learn more (including information on how to write your own classes), check out Programming Perl (O'Reilly).
The first step in writing a program that uses the LWP classes is to create and initialize the browser object, which can be used throughout the rest of the program. You need a browser object to perform HTTP requests, and although you could use several browser objects per program, I've never run into a reason to use more than one.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Basic Classes
In LWP's object model, you perform GET, HEAD, and POST requests via a browser object (a.k.a. a user agent object) of class LWP::UserAgent, and the result is an HTTP response of the aptly named class HTTP::Response. These are the two main classes, with other incidental classes providing features such as cookie management and user agents that act as spiders. Still more classes deal with non-HTTP aspects of the Web, such as HTML. In this chapter, we'll deal with the classes needed to perform web requests.
The classes can be loaded individually:
use LWP::UserAgent;
use HTTP::Response;
But it's easiest to simply use the LWP convenience class, which loads LWP::UserAgent and HTTP::Response for you:
use LWP;               # same as previous two lines
If you're familiar with object-oriented programming in Perl, the LWP classes will hold few real surprises for you. All you need is to learn the names of the basic classes and accessors. If you're not familiar with object-oriented programming in any language, you have some catching up to do. Appendix G will give you a bit of conceptual background on the object-oriented approach to things. To learn more (including information on how to write your own classes), check out Programming Perl (O'Reilly).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Programming with LWP Classes
The first step in writing a program that uses the LWP classes is to create and initialize the browser object, which can be used throughout the rest of the program. You need a browser object to perform HTTP requests, and although you could use several browser objects per program, I've never run into a reason to use more than one.
The browser object can use a proxy (a server that fetches web pages for you, such as a firewall, or a web cache such as Squid). It's good form to check the environment for proxy settings by calling env_proxy():
use LWP::UserAgent;
my $browser = LWP::UserAgent->new(  );
$browser->env_proxy(  ); # if we're behind a firewall
That's all the initialization that most user agents will ever need. Once you've done that, you usually won't do anything with it for the rest of the program, aside from calling its get( ), head( ), or post( ) methods, to get what's at a URL, or to perform HTTP HEAD or POST requests on it. For example:
$url = 'http://www.guardian.co.uk/';
my $response = $browser->get($url);
Then you call methods on the response to check the status, extract the content, and so on. For example, this code checks to make sure we successfully fetched an HTML document that isn't worryingly short, then prints a message depending on whether the words "Madonna" or "Arkansas" appear in the content:
die "Hmm, error \"", $response->status_line(  ),
  "\" when getting $url"  unless $response->is_success(  );
my $content_type = $response->content_type(  );
die "Hm, unexpected content type $content_type from $url"
   unless $content_type eq 'text/html';
my $content = $response->content(  );
die "Odd, the content from $url is awfully short!"
   if length($content) < 3000;
if($content =~ m/Madonna|Arkansas/i) {
   print "<!-- The news today is IMPORTANT -->\n",
         $content;
} else {
   print "$url has no news of ANY CONCEIVABLE IMPORTANCE!\n";
}
As you see, the response object contains all the data from the web server's response (or an error message about how that server wasn't reachable!), and we use method calls to get at the data. There are accessors for the different parts of the response (e.g., the status line) and convenience functions to tell us whether the response was successful (
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Inside the do_GET and do_POST Functions
You now know enough to follow the do_GET( ) and do_POST( ) functions introduced in Chapter 2. Let's look at do_GET( ) first.
Start by loading the module, then declare the $browser variable that will hold the user agent. It's declared outside the scope of the do_GET( ) subroutine, so it's essentially a static variable, retaining its value between calls to the subroutine. For example, if you turn on support for HTTP cookies, this browser could persist between calls to do_GET( ), and cookies set by the server in one call would be sent back in a subsequent call.
use LWP;
my $browser;
sub do_GET {
Next, create the user agent if it doesn't already exist:
$browser = LWP::UserAgent->new(  ) unless $browser;
Enable proxying, if you're behind a firewall:
$browser->env_proxy();
Then perform a GET request based on the subroutine's parameters:
my $response = $browser->request(@_);
In list context, you return information provided by the response object: the content, status line, a Boolean indicating whether the status meant success, and the response object itself:
return($response->content, $response->status_line, $response->is_success, $response)
  if wantarray;
If there was a problem and you called in scalar context, we return undef:
return unless $response->is_success;
Otherwise we return the content:
  return $response->content;
}
The do_POST( ) subroutine is just like do_GET( ), only it uses the post( ) method instead of get( ).
The rest of this chapter is a detailed reference to the two classes we've covered so far: LWP::UserAgent and HTTP::Response.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
User Agents
The first and simplest use of LWP's two basic classes is LWP::UserAgent, which manages HTTP connections and performs requests for you. The new( ) constructor makes a user agent object:
$browser = LWP::UserAgent->new(%options);
The options and their default values are summarized in Table 3-1. The options are attributes whose values can be fetched or altered by the method calls described in the next section.
Table 3-1: Constructor options and default values for LWP::UserAgent
Key
Default
agent
"libwww-perl/#.###"
conn_cache
undef
cookie_jar
undef
from
undef
max_size
undef
parse_head
1
protocols_allowed
undef
protocols_forbidden
undef
requests_redirectable
['GET', 'HEAD']
timeout
180
If you have a user agent object and want a copy of it (for example, you want to run the same requests over two connections, one persistent with KeepAlive and one without) use the clone( ) method:
$copy = $browser->clone(  );
This object represents a browser and has attributes you can get and set by calling methods on the object. Attributes modify future connections (e.g., proxying, timeouts, and whether the HTTP connection can be persistent) or the requests sent over the connection (e.g., authentication and cookies, or HTTP headers).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
HTTP::Response Objects
You have to manually create most objects your programs work with by calling an explicit constructor, with the syntax ClassName ->new( ). HTTP::Response objects are a notable exception. You never need to call HTTP::Response->new( ) to make them; instead, you just get them back as the result of a request made with one of the request methods (get( ), post( ), and head( )).
That is, when writing web clients, you never need to create a response yourself. Instead, a user agent creates it for you, to encapsulate the results of a request it made. You do, however, interrogate a response object's attributes. For example, the code( ) method returns the HTTP status code:
print "HTTP status: ", $response->code(  ), "\n";
HTTP status: 404
         
HTTP::Response objects also have convenience methods. For example, is_success( ) returns a true value if the response had a successful HTTP status code, or false if it didn't (e.g., 404, 403, 500, etc.). Always check your responses, like so:
die "Couldn't get the document"
  unless $response->is_success(  );
You might prefer something a bit more verbose, like this:
# Given $response and $url ...
die "Error getting $url\n", $response->status_line
  unless $response->is_success(  );
The status_line( ) method returns the entire HTTP status line:
$sl = $response->status_line(  );
This includes both the numeric code and the explanation. For example:
$resp = $browser->get("http://www.cpan.org/nonesuch");
print $response->status_line(  );
404 Not Found
            
To get only the status code, use the code( ) method:
$code = $response->code(  );
To access only the explanatory message, use the message( ) method:
$msg = $response->message(  );
For example:
$resp = $browser->get("http://www.cpan.org/nonesuch");
print $response->code(), " (that means ", $response->message(  ), " )\n";
404 (that means Not Found)
            
Four methods test for types of status codes in the response: is_error( ), is_success( ), is_redirect( ), and is_info( ). They return true if the status code corresponds to an error, a successful fetch, a redirection, or informational (e.g., "102 Processing").
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
LWP Classes: Behind the Scenes
To get data off the Web with LWP, you really only need to know about LWP::UserAgent objects and HTTP::Response objects (although a rudimentary knowledge of the URI class and the LWP::Cookies class can help too). But behind the scenes, there are dozens and dozens of classes that you generally don't need to know about, but that are still busily doing their work. Most of them are documented in the LWP manual pages, and you may see them mentioned in the documentation for the modules about which you do need to know. For completeness, they are listed in Appendix A.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: URLs
Now that you've seen how LWP models HTTP requests and responses, let's study the facilities it provides for working with URLs. A URL tells you how to get to something: "use HTTP with this host and request this," "connect via FTP to this host and retrieve this file," or "send email to this address."
The great variety inherent in URLs is both a blessing and a curse. On one hand, you can stretch the URL syntax to address almost any type of network resource. However, this very flexibility means attempts to parse arbitrary URLs with regular expressions rapidly run into a quagmire of special cases.
The LWP suite of modules provides the URI class to manage URLs. This chapter describes how to create objects that represent URLs, extract information from those objects, and convert between absolute and relative URLs. This last task is particularly useful for link checkers and spiders, which take partial URLs from HTML links and turn those into absolute URLs to request.
Rather than attempt to pull apart URLs with regular expressions, which is difficult to do in a way that works with all the many types of URLs, you should use the URI class. When you create an object representing a URL, it has attributes for each part of a URL (scheme, username, hostname, port, etc.). Make method calls to get and set these attributes.
Example 4-1 creates a URI object representing a complex URL, then calls methods to discover the various components of the URL.
Example 4-1. Decomposing a URL
use URI;
my $url = URI->new('http://user:pass@example.int:4345/hello.php?user=12');
print "Scheme: ", $url->scheme(  ), "\n";
print "Userinfo: ", $url->userinfo(  ), "\n";
print "Hostname: ", $url->host(  ), "\n";
print "Port: ", $url->port(  ), "\n";
print "Path: ", $url->path(  ), "\n";
print "Query: ", $url->query(  ), "\n";
Example 4-1 prints:
            Scheme: http
            Userinfo: user:pass
            Hostname: example.int
            Port: 4345
            Path: /hello.php
            Query: user=12
         
Besides reading the parts of a URL, methods such as
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Parsing URLs
Rather than attempt to pull apart URLs with regular expressions, which is difficult to do in a way that works with all the many types of URLs, you should use the URI class. When you create an object representing a URL, it has attributes for each part of a URL (scheme, username, hostname, port, etc.). Make method calls to get and set these attributes.
Example 4-1 creates a URI object representing a complex URL, then calls methods to discover the various components of the URL.
Example 4-1. Decomposing a URL
use URI;
my $url = URI->new('http://user:pass@example.int:4345/hello.php?user=12');
print "Scheme: ", $url->scheme(  ), "\n";
print "Userinfo: ", $url->userinfo(  ), "\n";
print "Hostname: ", $url->host(  ), "\n";
print "Port: ", $url->port(  ), "\n";
print "Path: ", $url->path(  ), "\n";
print "Query: ", $url->query(  ), "\n";
Example 4-1 prints:
            Scheme: http
            Userinfo: user:pass
            Hostname: example.int
            Port: 4345
            Path: /hello.php
            Query: user=12
         
Besides reading the parts of a URL, methods such as host( ) can also alter the parts of a URL, using the familiar convention that $object->method reads an attribute's value and $object->method( newvalue ) alters an attribute:
use URI;
my $uri = URI->new("http://www.perl.com/I/like/pie.html");
$uri->host('testing.perl.com');
print $uri,"\n";
http://testing.perl.com/I/like/pie.html
         
Now let's look at the methods in more depth.
An object of the URI class represents a URL. (Actually, a URI object can also represent a kind of URL-like string called a URN, but you're unlikely to run into one of those any time soon.) To create a URI object from a string containing a URL, use the new( ) constructor:
$url = URI->new(url [, scheme ]);
If url is a relative URL (a fragment such as staff/alicia.html), scheme determines the scheme you plan for this URL to have (
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Relative URLs
URL paths are either absolute or relative. An absolute URL starts with a scheme, then has whatever data this scheme requires. For an HTTP URL, this means a hostname and a path:
http://phee.phye.phoe.fm/thingamajig/stuff.html
Any URL that doesn't start with a scheme is relative. To interpret a relative URL, you need a base URL that is absolute (just as you don't know the GPS coordinates of "800 miles west of here" unless you know the GPS coordinates of "here").
A relative URL leaves some information implicit, which you look to its base URL for. For example, if your base URL is http://phee.phye.phoe.fm/thingamajig/stuff.html, and you see a relative URL of /also.html, then the implicit information is "with the same scheme (http)" and "on the same host (phee.phye.phoe.fm)," and the explicit information is "with the path /also.html." So this is equivalent to an absolute URL of:
http://phee.phye.phoe.fm/also.html
Some kinds of relative URLs require information from the path of the base URL in a way that closely mirrors relative filespecs in Unix filesystems, where ".." means "up one level", "." means "in this level", and anything else means "in this directory". So a relative URL of just zing.xml interpreted relative to http://phee.phye.phoe.fm/thingamajig/stuff.html yields this absolute URL:
http://phee.phye.phoe.fm/thingamajig/zing.xml
That is, we use all but the last bit of the absolute URL's path, then append the new component.
Similarly, a relative URL of ../hi_there.jpg interpreted against the absolute URL http://phee.phye.phoe.fm/thingamajig/stuff.html gives us this URL:
http://phee.phye.phoe.fm/hi_there.jpg
In figuring this out, start with http://phee.phye.phoe.fm/thingamajig/ and the ".." tells us to go up one level, giving us http://phee.phye.phoe.fm/. Append hi_there.jpg giving us the URL you see above.
There's a third kind of relative URL, which consists entirely of a fragment, such as #endnotes. This is commonly met with in HTML docu