Web Client Programming with Perl: Chapter 6: Example LWP Programs

Web Client Programming with Perl

Automating Tasks on the Web

By Clinton Wong
1st Edition March 1997

This book is out of print, but it has been made available online through the O'Reilly Open Books Project.

Chapter 6.
Example LWP Programs

In this chapter:
Simple Clients
Periodic Clients
Recursive Clients

This chapter presents LWP programs that are more robust and feature-rich than the examples shown in previous chapters. While Chapter 5, The LWP Library, focused on teaching LWP and explained how LWP objects fit together, this chapter shows you some sample LWP programs with more user-friendly options and features.

We present three broad categories of web client programs:

Simple clients--programs that perform actions for users in real time, usually with a finite list of URLs to act upon. In this section, we present LWP versions of the hcat and hgrepurl programs that were presented in Chapter 4, The Socket Library.

Periodic clients--robots that perform a request repeatedly, with some delay between each request. Periodic clients typically request the same resource over and over, or a different resource in a predictable manner. For example, a client may request 0100.gif at 1 a.m., 0200.gif at 2 a.m, etc. A periodic client might check some data and perform action when a condition is met. In this section, we present a program that periodically checks the status of a Federal Express document.

Recursive clients--robots that follow hyperlinks or other references on an HTML page. In this section, we present a program that looks for bad links in a web site.

The boundaries between these categories are not set in stone. It is possible to write a periodic client that also happens to be a recursive client. Or a simple client might become periodic if the document indicates that the page should be refreshed every 15 minutes. We're not trying to classify all programs into one category or another; these categories are given as a way to identify distinct behaviors that a client may exhibit.

The examples in this chapter all use a simple command-line interface. In Chapter 7, Graphical Examples with Perl/Tk, we have some additional examples with a graphical interface using the Tk extension to Perl.

Simple Clients

Simple clients are programs that perform actions for users in real time, usually with a finite list of URLs to act upon. In this section, we'll show LWP versions of the socket-based hcat and hgrepurl programs that were presented in Chapter 4.

Hypertext UNIX cat Revisited

As you might recall, the sockets version of hcat used the open_TCP( ) function to establish a connection to a web server, and then issued an HTTP request, like "GET / HTTP/1.0". In LWP, many of the details are hidden from the programmer. Instead of this:

open_TCP(F, $the_url[1], $the_url[2])
print F "GET $the_url[3] HTTP/1.0\n";
print F "Accept: */*\n";
print F "User-Agent: hcat/1.0\n\n";

in LWP, it can be written like this:

my $ua = new LWP::UserAgent;
$ua->agent("hcat/1.0");
my $request = new HTTP::Request("GET", $path);
my $response = $ua->request($request);

They both do the same thing; they request a document from a user-specified web server and identify themselves in the User-Agent header. But one looks a lot cleaner than the other. Instead of using the nitty-gritty socket code that talks directly to the web server, you merely describe to LWP what the action should be. LWP handles it for you. Many things, like handling URL redirection or handling HTTP version differences, will be handled automatically by LWP.

Also, the following lines in the sockets version of hcat can be replaced:

# print out server's response.
 
  # get the HTTP response line
  $the_response=<F>;
  print $the_response if ($all || $response);
 
  # get the header data
  while(<F>=~ m/^(\S+):\s+(.+)/) {
    print "$1: $2\n" if ($all || $header);
  }
 
  # get the entity body
  if ($all || $data) {
    print while (<F>);
  }

In LWP, these lines can be written as:

my $code=$response->code;
my $desc = HTTP::Status::status_message($code);
my $headers=$response->headers_as_string;
my $body =  $response->content;
 
if ($opt_r || $all) { print "HTTP/1.0 $code $desc\n"; }
if ($opt_H || $all) { print "$headers\n";             }
if ($opt_d || $all) { print $body;                    }

In addition, we've added proxy support, since it's trivial in LWP:

my $ua = new LWP::UserAgent;
$ua->agent("hcat/1.0");
 
# If proxy server specified, define it in the User Agent object
  if (defined $proxy) {
    my $url = new URI::URL $path;
    my $scheme = $url->scheme;
    $ua->proxy($scheme, $proxy);
  }

The source in its entirety looks like this:

#!/usr/local/bin/perl -w
 
use strict;
use HTTP::Status;
use HTTP::Response;
use LWP::UserAgent;
use URI::URL;
use vars qw($opt_h $opt_r $opt_H $opt_d $opt_p);
use Getopt::Std;
 
my $url;
my $goterr;

After calling all the necessary Perl modules and declaring variables, we process command-line arguments:

getopts('hrHdp:');
my $all = !($opt_r || $opt_H || $opt_d);    # all=1 when -r -H -d not set
 
if ($opt_h || $#ARGV==-1) {  # print help text when -h or no args
  print_help( );
  exit(0);
}

Then, for any string that remains as a command-like parameter, we treat it as a URL, process it, and print out the result:

my $goterr = 0;  # make sure we clear the error flag
 
while ($url = shift @ARGV) {
 
  my ($code, $desc, $headers, $body)=simple_get('GET', $url, $opt_p);
  if ($opt_r || $all) { print "HTTP/1.0 $code $desc\n"; }
  if ($opt_H || $all) { print "$headers\n";             }
  if ($opt_d || $all) { print $body;                    }
 
  $goterr |= HTTP::Status::is_error($code);
}
 
exit($goterr);

The print-help( ) routine just prints out a range line and a list of command-line options:

sub print_help {
  print <<"HELP";
usage: $0 [-hrmbp] [proxy URL] URLs
 
 -h help
 -r response line only
 -H HTTP header data only
 -d data from entity body
 -p use this proxy server
 
Example:  $0 -p http://proxy:8080/ http://www.ora.com
 
HELP
}

The actual processing is done in a separate function, called simple_get( ):

sub simple_get( ) {
 
  my ($method, $path, $proxy) = @_;
 
# Create a User Agent object
  my $ua = new LWP::UserAgent;
  $ua->agent("hcat/1.0");
 
# If proxy server specified, define it in the User Agent object
  if (defined $proxy) {
    my $url = new URI::URL $path;
    my $scheme = $url->scheme;
    $ua->proxy($scheme, $proxy);
  }
 
# Ask the User Agent object to request a URL.
# Results go into the response object (HTTP::Reponse).
 
  my $request = new HTTP::Request($method, $path);
  my $response = $ua->request($request);
 
# Parse/convert the response object for "easier reading"
 
  my $code=$response->code;
  my $desc = HTTP::Status::status_message($code);
  my $headers=$response->headers_as_string;
 
  my $body =  $response->content;
  $body =  $response->error_as_HTML if ($response->is_error);
 
  return ($code, $desc, $headers, $body);
}

Within simple_get( ), an LWP::UserAgent object is created, and a proxy server is defined for the object if one was specified to simple_get( ). A new HTTP::Request object is created with the HTTP method and path that are passed to simple_get( ). The request is given to UserAgent's request( ) method, and an HTTP::Response object is returned. From there, HTTP::Response::code( ), HTTP::Response::headers_as_string( ), and HTTP::Response::content( ) are used to extract the response information from the HTTP::Response object.

Hypertext Grep URLs Revisited

The code that does the HTTP request of hgrepurl looks very much like hcat 's. Instead of repeating that information, let's center on another chunk of code that changed from the sockets version of hgrepurl.

In Chapter 4, the raw sockets version checked the response code and then skipped over the HTTP headers:

# if not an "OK" response of 200, skip it
if ($the_response !~ m@^HTTP/\d+\.\d+\s+200\s@) {return;}
 
# get the header data
while(<F>=~ m/^(\S+):\s+(.+)/) {
  # skip over the headers
}

In LWP, this can more easily be said with something like this:

  if ($response->code!= RC_OK) { return; }
  if ($response->content_type !~ m@text/html@) { return; }

In the process of finding URLs without the help of LWP, one would have to do something like this:

$data =~ s/<([^>]*)>//;
$in_brackets=$1;
$key='a';
$tag='href';
if ($in_brackets =~ /^\s*$key\s+/i) {     # if tag matches, try parms
  if ($in_brackets =~ /\s+$tag\s*=\s*"([^"]*)"/i) {
    $link=$1;
    $link =~ s/[\n\r]//g;  # kill newlines,returns anywhere in url
    # process the URL here
  } 
}

But in LWP, this simplifies to something like this:

my $parsed_html=HTML::Parse::parse_html($data);
for (@{ $parsed_html->extract_links(qw (body img)) }) {
  my ($link) = @$_;
  # process the URL here
}

As you can see, LWP simplified a lot of the code. Let's go over hgrepurl in a little more detail:

#!/usr/local/bin/perl -w
 
use strict;
use HTTP::Status;
use HTTP::Response;
use LWP::UserAgent;
use URI::URL;
use HTML::Parse;
use vars qw($opt_h $opt_i $opt_l $opt_p);
use Getopt::Std;
 
my $url;

After calling all the necessary modules and declaring variables, there's the usual command-line processing with getopts( ):

getopts('hilp:');
my $all = !($opt_i || $opt_l);       # $all=1 when -i -l not set
 
if ($opt_h || $#ARGV==-1) {  # print help text when -h or no args
  print_help( );
  exit(0);
}

Any remaining command-line arguments are treated as URLs and passed to get_html( ):

while ($url = shift @ARGV) {
 
  my ($code, $type, $data)  = get_html($url, $opt_p, $opt_i, $opt_l);
  if (not_good($code, $type)) { next; }
  if ($opt_i || $all) { print_images($data, $url); }
  if ($opt_l || $all) { print_hyperlinks($data, $url); }
 
} # while there are URLs on the command line

As in hcat, print_help( ) displays a help message:

sub print_help {
  print << "HELP";
usage: $0 [-hilp] [proxy URL] URLs
 
 -h help
 -i grep out images references only
 -l grep out hyperlink references only
 -p use this proxy server
 
Example:  $0 -p http://proxy:8080/ http://www.ora.com
 
HELP
}

The get_html( ) routine is defined next. The response of get_html( ) is the response code, content type, and entity-body of the reply.

sub get_html( ) {
  my($url, $proxy, $want_image, $want_link) = @_;
 
# Create a User Agent object
  my $ua = new LWP::UserAgent;
  $ua->agent("hgrepurl/1.0");
 
# If proxy server specified, define it in the User Agent object
  if (defined $proxy) {
    my $proxy_url = new URI::URL $url;
    $ua->proxy($proxy_url->scheme, $proxy);
  }
 
# Ask the User Agent object to request a URL.
# Results go into the response object (HTTP::Reponse).
 
  my $request = new HTTP::Request('GET', $url);
  my $response = $ua->request($request);
 
  return ($response->code, $response->content_type,
      $response->content);
}

The not_good( ) routine tells us if the document that was returned was HTML, since the program doesn't really make sense otherwise:

# returns 1 if the request was not OK or HTML, else 0
 
sub not_good {
  my ($code, $type) = @_;
 
  if ($code != RC_OK) {
    warn("$url had response code of $code");
    return 1;
  }
 
  if ($type !~ m@text/html@) {
    warn("$url is not HTML.");
    return 1;
  }
  return 0;
}

The print-images( ) and print-hyperlinks( ) routines display any links found in the document:

sub print_images {
 
  my ($data, $model) = @_;
 
  my $parsed_html=HTML::Parse::parse_html($data);
  for (@{ $parsed_html->extract_links(qw (body img)) }) {
    my ($link) = @$_;
    my ($absolute_link) = globalize_url($link, $model);
    print "$absolute_link\n";
  }
  $parsed_html->delete( ); # manually do garbage collection
}
 
sub print_hyperlinks {
 
  my ($data, $model) = @_;
 
  my $parsed_html=HTML::Parse::parse_html($data);
  for (@{ $parsed_html->extract_links(qw (a)) }) {
    my ($link) = @$_;
    my ($absolute_link) = globalize_url($link, $model);
    print "$absolute_link\n";
  }
  $parsed_html->delete( ); # manually do garbage collection
}

Finally, the globalize_url() function returns the absolute URL version of a relative URL.

sub globalize_url( ) {
 
  my ($partial, $model) = @_;
  my $url = new URI::URL($partial, $model);
  my $globalized = $url->abs->as_string;
 
  return $globalized;
}

Periodic Clients

The Federal Express checker program, or FedEx, is very much like the previous program, hgrepurl. They're similar because they both look at the entity-body and attempt to find some useful information in it. While hgrepurl merely prints out any URLs that it finds, the FedEx program looks for a certain phrase followed by a colon (:) followed by more text. After shipping out a few documents and watching the HTML that corresponds to each, we've noticed the following pattern:

When a document is delivered, the text "Delivered To : " shows up in the entity-body and is followed by the name of the recipient of the document.

For some reason, when the document is delivered, the "Delivered To : " is sometimes blank, but the "Delivery Time : " field is filled in.

If the tracking information isn't ready, or if the requested information doesn't exist, there isn't a "Delivered To : " field at all. In this case, there's a descriptive error message between the  and  tags in the response.

If "Delivered To : " shows up in the reply (with or without text after the colon), the query was successful but the document is not at the destination yet.

Otherwise, the request resulted in an error.

Given all this, we wrote a FedEx package that connects to the Federal Express web site and does a query on a periodic basis. The package is implemented as a class, so you can easily transport this to another program, if you want. In Chapter 7, we'll show a graphical interface to this package.

For now, let's dissect the FedEx class. First, we have a constructor that accepts three parameters from the programmer: the URL of the CGI program to use, the email address of the person using the program, and an optional third parameter that specifies a proxy server to use. These settings are stored internally in the newly created FedEx object as a URI::URL object and LWP::RobotUA object:

package FedEx;
sub new {
 
  my($class, $cgi_url, $email, $proxy) = @_;
  my $user_agent_name = 'ORA-Check-FedEx/1.0';
 
  my $self  = {};
  bless $self, $class;
 
  $self->{'url'} = new URI::URL $cgi_url;
 
  $self->{'robot'} = new LWP::RobotUA $user_agent_name, $email; 
  $self->{'robot'}->delay(0);    # we'll delay requests by hand
 
  if ($proxy) {
    $self->{'robot'}->proxy('http', $proxy);
  }
 
  $self;
}

The check( ) method accepts a tracking number, country string, and date as parameters. From there, a properly encoded query string is added to the URI::URL object with a call to $self->{'url'}query( ). A new HTTP::Request( ) object is made with the URI::URL object as a parameter. The request is issued with the call to $self->{'robot'}->request( ) and a HTTP::Response object is returned:

sub check {
 
  my ($self, $track_num, $country, $date) = @_;
 
  $self->{'url'}->query("trk_num=$track_num&dest_cntry=" .
    "$country&ship_date=$date");
  my $request = new HTTP::Request 'GET', $self->{'url'};
 
  my $response = $self->{'robot'}->request($request);
  $self->{'status'} = $response->code( );

If the HTTP request was a success, then we can analyze the response. If the FedEx document was delivered, the "Delivered To : " field is filled out. When this happens, the FedEx package sets a few internal values to reflect this:

if ($response->content =~ /Delivered To : (\w.*)/) {
 
  # package delivered
  $self->{'who_got_it'} = $1;
  $self->{'delivered'} = 1;
}

As noted before, sometimes when the document is delivered, the "Delivered To : " field is blank, but the "Delivery Time : " field is set:

elsif ($response->content =~ /Delivery Time : \w.*/) {
 
  # package delivered
  $self->{'who_got_it'} = 'left blank by FedEx computer';
  $self->{'delivered'} = 1;
}

If "Delivered To : " shows up in the reply, the query was successful but the document didn't arrive. But if it didn't show up, there's something wrong with the request. A descriptive error message should show up between the  and  tags:

if ($response->content !~ /Delivered To : /) {
  $self->{'status'} = RC_BAD_REQUEST;
 
  # get explanation from HTML response
  my $START = '<!-- BEGIN TRACKING INFORMATION -->';
  my $END = '<!-- END TRACKING INFORMATION -->';
 
  if   ($response->content =~ /$START(.*?)$END/s) {
    $self->{'error_as_HTML'} = $1;
  }
  else {
    # couldn't get explanation, use generic one
    $self->{'error_as_HTML'} = 'Unexpected HTML response from FedEx'; 
} # couldn't get error explanation

And then there are cases when the HTTP response didn't result in a status code of 200 (OK):

$self->{'error_as_HTML'} = $response->error_as_HTML;

That just about wraps up the FedEx package.

For the sake of being a good Object-Oriented citizen, a public interface to the FedEx object's settings are available. The retrieve_okay( ) method returns true when the HTTP response code was 200. The delivered( ) method returns true if the document was delivered. The who_got_it( ) method returns the name of the recipient of a delivered package. Finally, the error_info( ) method prints out an HTML error message.

Now that we've reviewed the important parts of the FedEx package, let's take a look at the complete example. Note how one creates a FedEx object and calls it. We'll come back to this example and redo it as a graphical client in Chapter 7:

#!/usr/local/bin/perl -w 
use strict;
 
use HTML::FormatText;
use HTML::Parse;
use vars qw($opt_h $opt_a $opt_e $opt_d $opt_c $opt_p);
use Getopt::Std;
 
# URL that handles our FedEx query
my $cgi = 'http://www.fedex.com/cgi-bin/track_it';
 
getopts('ha:e:d:c:p:');
 
# print help upon request or when arguments are missing
if ($opt_h  || !($opt_a && $opt_e && $opt_d && $opt_c )) {
  print_help( );
  exit(0);
}
 
# 
my $tracker = new FedEx $cgi, $opt_e, $opt_ p;
 
my $keep_checking = 1;

First, we declare local variables, call all necessary modules, get command-line options, etc.

The body of the program is just a loop that keeps checking the FedEx site until the package is delivered or an error is found:

while ($keep_checking) {
  $tracker->check($opt_a, $opt_c, $opt_d);
 
  if ($tracker->retrieve_okay) {
 
    if ($tracker->delivered) {
      print "Tracking number $opt_a was delivered to: ",
             $tracker->who_got_it, "\n";
      $keep_checking = 0;
 
    } 
    else {
 
      # request was okay, but not delivered.  Let's wait
      sleep (60 * 30);  # sleep 30 minutes
    }
 
  } 
  else {
 
    # request not successful
    my $html_error_message = $tracker->error_info;
 
    my $parsed    = parse_html($html_error_message);
    my $converter = new HTML::FormatText;
    print $converter->format($parsed);
    
    $keep_checking = 0;
  }
}

The print_help( ) routine prints a help message, as always:

sub  print_help {
 
  print <<HELP
This program prints a notification when a FedEx shipment is delivered.
fedex -a 1234 -e user\@host.com -d 120396 -c U.S.A. [ -p http://host:port/ ] 
 
h - this help text
a - airbill number
e - your email address
d - date in MMDDYY format that document was sent
c - country of recipient
p - use this proxy server [optional]
HELP
}

Now the code we showed you previously, defining the FedEx package:

package FedEx;
 
use HTTP::Request;
use HTTP::Response;
use LWP::RobotUA;
use HTTP::Status;
 
sub new {
 
  my($class, $cgi_url, $email, $proxy) = @_;
  my $user_agent_name = 'ORA-Check-FedEx/1.0';
 
  my $self  = {};
  bless $self, $class;
 
  $self->{'url'} = new URI::URL $cgi_url;
 
  $self->{'robot'} = new LWP::RobotUA $user_agent_name, $email; 
  $self->{'robot'}->delay(0);    # we'll delay requests by hand
 
  if ($proxy) {
    $self->{'robot'}->proxy('http', $proxy);
  }
 
  $self;
}
 
sub check {
 
  my ($self, $track_num, $country, $date) = @_;
 
  $self->{'url'}->query("trk_num=$track_num&dest_cntry=" .
    "$country&ship_date=$date");
  my $request = new HTTP::Request 'GET', $self->{'url'};
 
  my $response = $self->{'robot'}->request($request);
  $self->{'status'} = $response->code( );
 
  if ($response->code == RC_OK) {
 
    if ($response->content =~ /Delivered To : (\w.*)/) {
 
      # package delivered
      $self->{'who_got_it'} = $1;
      $self->{'delivered'} = 1;
    } 
 
    # Odd cases when package is delivered but "Delivered To" is blank.
    # Check for delivery time instead.
 
    elsif ($response->content =~ /Delivery Time : \w.*/) {
 
      # package delivered
      $self->{'who_got_it'} = 'left blank by FedEx computer';
      $self->{'delivered'} = 1;
    }
    else {
 
      # package wasn't delivered
      $self->{'delivered'} = 0;
 
      # if there isn't a "Delivered To : " field, something's wrong.
      # error messages seen between HTML comments
 
      if ($response->content !~ /Delivered To : /) {
        $self->{'status'} = RC_BAD_REQUEST;
 
        # get explanation from HTML response
	my $START = '<!-- BEGIN TRACKING INFORMATION -->';
	my $END = '<!-- END TRACKING INFORMATION -->';
 
        if   ($response->content =~ /$START(.*?)$END/s) {
    
	  $self->{'error_as_HTML'} = $1;
        
        } 
        else {
          # couldn't get explanation, use generic one
          $self->{'error_as_HTML'} = 'Unexpected HTML response from
             FedEx'; 
 
        }    # couldn't get error explanation
      }      # unexpected reply
    }        # not delivered yet
  }          # if HTTP response of RC_OK (200)
  else {
    $self->{'error_as_HTML'} = $response->error_as_HTML;
  }
 
}
 
sub retrieve_okay {
 
  my $self = shift;
  if ($self->{'status'} != RC_OK) {return 0;}
  1;
}
 
sub delivered {
 
  my $self = shift;
  $self->{'delivered'};
}
 
sub who_got_it {
 
  my $self = shift;
  $self->{'who_got_it'};
}
 
sub error_info {
 
  my $self = shift;
  $self->{'error_as_HTML'};
}

Recursive Clients

Recursive clients are robots that follow hyperlinks or other references on an HTML page. In this section, we present a program that looks for bad links in a web site. I've created a package called CheckSite that follows links within HTML and reports various properties of each page. The constructor accepts the email address, delay time between requests, maximum number of requests, verbose flag, and optional proxy URL as parameters. As in the FedEx example, this creates an LWP::RobotUA object inside the CheckSite package.

package CheckSite;
sub new {
 
  my ($class, $email, $delay, $max, $verbose, $proxy) = @_;
  my $self = {};
  bless $self, $class;
 
 
  # Create a User Agent object, give it a name, set delay between requests
  $self->{'ua'} = new LWP::RobotUA 'ORA_checksite/1.0', $email;
  if (defined $delay) {$self->{'ua'}->delay($delay);}
 
  # If proxy server specified, define it in the User Agent object
  if (defined $proxy) {
    $self->{'ua'}->proxy('http', $proxy);
  }
 
  $self->{'max'} = $max;
  $self->{'verbose'} = $verbose;
 
  $self;
}

Then the scan( ) method does all the real work. The scan( ) method accepts a URL as a parameter. In a nutshell, here's what happens:

The scan( ) method pushes the first URL into a queue. For any URL pulled from the queue, any links on that page are extracted from that page and pushed on the queue. To keep track of which URLs have already been visited (and not to push them back onto the queue), we use an associative array called %touched and associate any URL that has been visited with a value of 1. There are other useful variables that are also used, to track which document points to what, the content-type of the document, which links are bad, which links are local, which links are remote, etc.

For a more detailed look at how this works, let's step through it.

First, the initial URL is pushed onto a queue:

push (@urls , $root_url);

The URL is then checked with a HEAD method. If we can determine that the URL is not an HTML document, we can skip it. Otherwise, we follow that with a GET method to get the HTML:

my $request  = new HTTP::Request('HEAD', $url);
my $response = $self->{'ua'}->request($request);
 
# if not HTML, don't bother to search it for URLs
next if ($response->header('Content-Type') !~ m@text/html@ );
 
    
# it is text/html, get the entity-body this time
$request->method('GET');
$response = $self->{'ua'}->request($request);

Then we extract the links from the HTML page. Here, we use our own function to extract the links. There is a similar function in the LWP library that extracts links, but we opted not to use it, since it is less prone to find links in slightly malformed HTML:

my @rel_urls = grab_urls($data);
 
foreach $verbose_link (@rel_urls) {
...
}

With each iteration of the foreach loop, we process one link. If we haven't seen it before, we add it to the queue:

foreach $verbose_link (@rel_urls) {
 
  if (! defined $self->{'touched'}{$full_child}) {
    push (@urls, $full_child);
  }
 
  # remember which url we just pushed, to avoid repushing
  $self->{'touched'}{$full_child} = 1;
}

While all of this is going on, we keep track of which documents don't exist, what their content types are, which ones are local to the web server, which are not local, and which are not HTTP-based. After scan( ) finishes, all of the information is available from CheckSite's public interface. The bad( ) method returns an associative array of any URLs that encountered errors. Within the associative array, one uses the URL as a key, and the key value is a \n delimited error message. For the not_web( ), local( ), and remote( ) methods, a similar associative array is returned, where the URL is a key in the array and denotes that the URL is not HTTP-based, is local to the web server, or is not local to the web server, in that order. The type( ) method returns an associate array of URLs, where the value of each URL hash contains the content-type for the URL. And finally, the ref( ) method is an associative array of URLs with values of referring URLs, delimited by \n. So if the URL hash of "www.ora.com" has a value of "a.ora.com" and "b.ora.com", that means "a.ora.com" and "b.ora.com" both point to "www.ora.com".

Here's the complete source of the CheckSite package, with some sample code around it to read in command-line arguments and print out the results:

#!/usr/local/bin/perl -w
use strict;
 
use vars qw($opt_a $opt_v $opt_l $opt_r $opt_R $opt_n $opt_b
            $opt_h $opt_m $opt_p $opt_e $opt_d);
use Getopt::Std;
 
 
# Important variables
#----------------------------
# @lookat     queue of URLs to look at
# %local      $local{$URL}=1  (local URLs in associative array)
# %remote     $remote{$URL}=1 (remote URLs in associative array)
# %ref        $ref{$URL}="URL\nURL\n" (list of URLs separated by \n)
# %touched    $touched{$URL}=1 (URLs that have been visited)
# %notweb     $notweb{$URL}=1 if URL is non-HTTP
# %badlist    $badlist{$URL}="reason" (URLs that failed. Separated with \n)
 
getopts('avlrRnbhm:p:e:d:');
 
# Display help upon -h, no args, or no e-mail address
 
if ($opt_h || $#ARGV == -1 || (! $opt_e) ) {
  print_help( );
  exit(-1);
}
 
# set maximum number of URLs to visit to be unlimited
 
my ($print_local, $print_remote, $print_ref, $print_not_web,
    $print_bad,   $verbose,      $max,       $proxy,
    $email,       $delay,        $url);
 
$max=0;
 
if ($opt_l) {$print_local=1;}
if ($opt_r) {$print_remote=1;}
if ($opt_R) {$print_ref=1;}
if ($opt_n) {$print_not_web=1;}
if ($opt_b) {$print_bad=1;}
if ($opt_v) {$verbose=1;}
if (defined $opt_m) {$max=$opt_m;}
if ($opt_ p) {$proxy=$opt_p;}
if ($opt_e) {$email=$opt_e;}
if (defined $opt_d) {$delay=$opt_d;}
if ($opt_a) {
  $print_local=$print_remote=$print_ref=$print_not_web=$print_bad = 1;
}
 
my $root_url=shift @ARGV;
 
# if there's no URL to start with, tell the user
unless ($root_url) {
  print "Error: need URL to start with\n";
  exit(-1);
}
 
# if no "output" options are selected, make "print_bad" the default
if (!($print_local || $print_remote || $print_ref ||
  $print_not_web || $print_bad)) {
  $print_bad=1;
}
 
# create CheckSite object and tell it to scan the site
my $site = new CheckSite($email, $delay, $max, $verbose, $proxy);
$site->scan($root_url);
 
# done with checking URLs.  Report results
 
 
# print out references to local machine
if ($print_local) {
  my %local = $site->local;
 
  print "\nList of referenced local URLs:\n";
  foreach $url (keys %local) {
    print "local: $url\n";
  }
}
 
 
# print out references to remote machines
if ($print_remote) {
  my %remote = $site->remote;
 
  print "\nList of referenced remote URLs:\n";
  foreach $url (keys %remote) {
    print "remote: $url\n";
  }
}
 
 
# print non-HTTP references
if ($print_not_web) {
  my %notweb = $site->not_web;
 
  print "\nReferenced non-HTTP links:\n";
  foreach $url (keys %notweb) {
    print "notweb: $url\n";
  }
}
 
# print reference list (what URL points to what)
if ($print_ref) {
  my $refer_by;
  my %ref = $site->ref;
 
  print "\nReference information:\n";
  while (($url,$refer_by) = each %ref) {
    print "\nref: $url is referenced by:\n";
    $refer_by =~ s/\n/\n  /g;  # insert two spaces after each \n
    print "  $refer_by";
  }
}
 
# print out bad URLs, the server response line, and the Referer
if ($print_bad) {
  my $reason;
  my $refer_by;
  my %bad = $site->bad;
  my %ref = $site->ref;
 
  print "\nThe following links are bad:\n";
  while (($url,$reason) = each %bad) {
    print "\nbad: $url  Reason: $reason";
    print "Referenced by:\n";
     $refer_by = $ref{$url};
     $refer_by =~ s/\n/\n  /g;  # insert two spaces after each \n
     print "  $refer_by";
  } # while there's a bad link
} # if bad links are to be reported
 
 
sub print_help( ) {
  print <<"USAGETEXT";
Usage:  $0 URL\n
Options:
  -l         Display local URLs
  -r         Display remote URLs
  -R         Display which HTML pages refers to what
  -n         Display non-HTML links
  -b         Display bad URLs (default)
  -a         Display all of the above
  -v         Print out URLs when they are examined
  -e email   Mandatory: Specify email address to include
	     in HTTP request. 
  -m #       Examine at most # URLs\n
  -p url     Use this proxy server
  -d #       Delay # minutes between requests.  (default=1)
	     Warning: setting # to 0 is not very nice.
  -h         This help text
 
Example: $0 -e me\@host.com -p http://proxy/ http://site_to_check/
USAGETEXT
  }
 
 
 
package CheckSite;
 
use HTTP::Status;
use HTTP::Request;
use HTTP::Response;
use LWP::RobotUA;
use URI::URL;
 
sub new {
 
  my ($class, $email, $delay, $max, $verbose, $proxy) = @_;
  my $self = {};
  bless $self, $class;
 
 
  # Create a User Agent object, give it a name, set delay between requests
  $self->{'ua'} = new LWP::RobotUA 'ORA_checksite/1.0', $email;
  if (defined $delay) {$self->{'ua'}->delay($delay);}
 
  # If proxy server specified, define it in the User Agent object
  if (defined $proxy) {
    $self->{'ua'}->proxy('http', $proxy);
  }
 
  $self->{'max'} = $max;
  $self->{'verbose'} = $verbose;
 
  $self;
}
 
sub scan {
 
  my ($self, $root_url)   = @_;
  my $verbose_link;
  my $num_visited = 0;
  my @urls;
 
  # clear out variables from any previous call to scan( )
  undef %{ $self->{'bad'} };
  undef %{ $self->{'not_web'} };
  undef %{ $self->{'local'} };
  undef %{ $self->{'remote'} };
  undef %{ $self->{'type'} };
  undef %{ $self->{'ref'} };
  undef %{ $self->{'touched'} };
 
  my $url_strict_state = URI::URL::strict( );   # to restore state later
  URI::URL::strict(1);
 
 
  my $parsed_root_url = eval { new URI::URL $root_url; };
  push (@urls , $root_url);
  $self->{'ref'}{$root_url} = "Root URL\n";
 
 
  while (@urls) {            # while URL queue not empty
    my $url=shift @urls;      # pop URL from queue & parse it
 
    # increment number of URLs visited and check if maximum is reached
    $num_visited++;
    last if (  ($self->{'max'}) && ($num_visited > $self->{'max'}) );
 
    # handle verbose information
    print STDERR "Looking at $url\n" if ($self->{'verbose'});
 
    my $parsed_url = eval { new URI::URL $url; };
 
    # if malformed URL (error in eval) , skip it
    if ($@) {
      $self->add_bad($url, "parse error: $@");
      next;
    }
 
    # if not HTTP, skip it
    if ($parsed_url->scheme !~ /http/i) {
      $self->{'not_web'}{$url}=1;
      next;
    }
 
    # skip urls that are not on same server as root url
    if (same_server($parsed_url, $parsed_root_url)) { 
      $self->{'local'}{$url}=1;
    } else {                                     # remote site
      $self->{'remote'}{$url}=1;
      next;              # only interested in local references
    }
 
    # Ask the User Agent object to get headers for the url
    # Results go into the response object (HTTP::Response).
 
 
    my $request  = new HTTP::Request('HEAD', $url);
    my $response = $self->{'ua'}->request($request);
 
    # if response wasn't RC_OK (200), skip it
    if ($response->code != RC_OK) {
      my $desc = status_message($response->code);
      $self->add_bad($url, "${desc}\n");
      next;
    }
 
    # keep track of every url's content-type
    $self->{'type'}{$url} = $response->header('Content-Type');
 
    # if not HTML, don't bother to search it for URLs
    next if ($response->header('Content-Type') !~ m@text/html@ );
 
    
    # it is text/html, get the entity-body this time
    $request->method('GET');
    $response = $self->{'ua'}->request($request);
 
    # if not OK or text/html... weird, it was a second ago.  skip it.
    next if ($response->code != RC_OK);
    next if ($response->header('Content-Type') !~ m@text/html@ );
 
    my $data     = $response->content;
    my @rel_urls = grab_urls($data);
 
    foreach $verbose_link (@rel_urls) {
 
      my $full_child =  eval {
        (new URI::URL $verbose_link, $response->base)->
        abs($response->base,1);
      };
      
      # if LWP doesn't recognize the child url, treat it as malformed
      if ($@) {
 
	# update list of bad urls, remember where it happened
	$self->add_bad($verbose_link, "unrecognized format: $@");
        $self->add_ref($verbose_link, $url);
 
        next;
      }
      else {
 
        # remove fragment in http urls
        if ( ($full_child->scheme( ) =~ /http/i) ) {
          $full_child->frag(''));
        }
 
        # handle reference list and push unvisited links onto queue
        $self->add_ref($full_child, $url);
        if (! defined $self->{'touched'}{$full_child}) {
          push (@urls, $full_child);
	}
 
        # remember which url we just pushed, to avoid repushing
        $self->{'touched'}{$full_child} = 1;
 
      }   # process valid links on page
    }     # foreach url in this page
  }       # while url(s) in queue
 
  URI::URL::strict($url_strict_state);  # restore state before exiting
 
} # scan
 
sub same_server {
  my ($host1, $host2) = @_;
 
  my $host2_name = $host2->host;
 
  if ($host1->host !~ /^$host2_name$/i) {return 0;}
  if ($host1->port != $host2->port) {return 0;}
 
  1;
}
 
# grab_urls($html_content) returns an array of links that are referenced
# from within the html.  Covers <body background>, <img src>, and <a href>.
# This includes a little more functionality than the 
# HTML::Element::extract_links( ) method.
sub grab_urls {
 
  my ($data) = @_;
  my @urls;
  my $key;
  my $link;
 
 
  my %tags = (
    'body' => 'background', 
    'img'  => 'src',
    'a'    => 'href'
  );
 
  # while there are HTML tags
  skip_others: while ($data =~ s/<([^>]*)>//)  {
 
    my $in_brackets=$1;
 
    foreach $key (keys %tags) {
 
      if ($in_brackets =~ /^\s*$key\s+/i) {     # if tag matches, try parms
        if ($in_brackets =~ /\s+$tags{$key}\s*=\s*["']([^"']*)["']/i) {
          $link=$1;
          $link =~ s/[\n\r]//g;  # kill newlines,returns anywhere in url
          push @urls, $link;
	  next skip_others;
        } 
	# handle case when url isn't in quotes (ie: <a href=thing>)
        elsif ($in_brackets =~ /\s+$tags{$key}\s*=\s*([^\s]+)/i) {
          $link=$1;
          $link =~ s/[\n\r]//g;  # kill newlines,returns anywhere in url
          push @urls, $link;
	  next skip_others;
        }    
      }       # if tag matches
    }         # foreach <a|img|body>
  }           # while there are brackets
  @urls;
}
 
# public interface to class's internal variables
 
# return associative array of bad urls and their error messages
sub bad {
  my $self = shift;
  %{ $self->{'bad'} };
}
 
# return associative array of encountered urls that are not http based
sub not_web {
  my $self = shift;
  %{ $self->{'not_web'} };
}
 
# return associative array of encountered urls that are local to the
# web server that was queried in the latest call to scan( )
 
sub local {
  my $self = shift;
  %{ $self->{'local'} };
}
 
# return associative array of encountered urls that are not local to the
# web server that was queried in the latest call to scan( )
 
sub remote {
  my $self = shift;
  %{ $self->{'remote'} };
}
 
# return associative array of encountered urls and their content-type
sub type {
  my $self = shift;
  %{ $self->{'type'} };
}
 
# return associative array of encountered urls and their parent urls,
# where parent urls are separated by newlines in one big string
 
sub ref {
  my $self = shift;
  %{ $self->{'ref'} };
}
 
# return associative array of encountered urls.  If we didn't push it
# into the queue of urls to visit, it isn't here.
 
sub touched {
  my $self = shift;
  %{ $self->{'touched'} };
}
 
# add_bad($child, $parent)
#   This keeps an associative array of urls, where the associated value 
#   of each url is an error message that was encountered when
#   parsing or accessing the url.  If error messages already exist for
#   the url, any additional error messages are concatenated to existing
#   messages.
 
sub add_bad {
  my ($self, $url, $msg) = @_;
 
  if (! defined $self->{'bad'}{$url} ) {
    $self->{'bad'}{$url}  = $msg;
  }
  else {
    $self->{'bad'}{$url} .= $msg;
  }
}
 
# add_ref($child, $parent)
#   This keeps an associative array of urls, where the associated value
#   of each url is a string of urls that refer to it.  So if 
#   url 'a' and 'b' refer to url 'c', then $self->{'ref'}{'c'}
#   would have a value of 'a\nb\n'.  The newline separates parent urls.
 
sub add_ref {
 
  my ($self, $child, $parent) = @_;
 
  if (! defined  $self->{'ref'}{$child} ) {
    $self->{'ref'}{$child} = "$parent\n";
  } 
  elsif ($self->{'ref'}{$child} !~ /$parent\n/) {
    $self->{'ref'}{$child} .= "$parent\n";
  }fo
 
}

In the following chapter, we'll do a few more examples, this time graphical examples using the Tk extension to Perl.

Back to: Chapter Index

Back to: Web Client Programming with Perl