By Sean M. Burke
Book Price: $34.95 USD
£24.95 GBP
PDF Price: $27.99
Cover | Table of Contents | Colophon
% perl -MLWP -le "print(LWP->VERSION)"
-le is a lowercase L, not
a digit one.)
Can't locate LWP in @INC (@INC contains: ...lots of paths...).
BEGIN failed--compilation aborted.% perl -MCPAN -eshell
We have to reconfigure CPAN.pm due to following uninitialized parameters:
perldoc
CPAN or by starting up
the CPAN shell (with perl
-MCPAN
-eshell at a system
shell prompt) and entering sleep 60; to sleep for one minute), so that the
load that you're placing on the network and on the
web server is spread unobtrusively over a longer period of time.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
my $catalog = get("http://www.oreilly.com/catalog");
my $count = 0;
$count++ while $catalog =~ m{Perl}gi;
print "$count\n";get( )
function
returns the document at a given URL or
undef if an error occurred. A regular expression
match in a loop counts the number of occurrences.
#!/usr/bin/perl -w
use strict;
use LWP;
my $browser = LWP::UserAgent->new( );
my $response = $browser->get("http://www.oreilly.com/");
print $response->header("Server"), "\n";$browser and
$response, are references to objects.
LWP::UserAgent object $browser makes requests of a
server and creates HTTP::Response objects such as
$response to represent the
server's reply. In Example 1-2, we
call the header( ) method on the response to check
one of the HTTP header values.
#!/usr/bin/perl -w
# pl8.pl - query California license plate database
use strict;
use LWP::UserAgent;
my $plate = $ARGV[0] || die "Plate to search for?\n";
$plate = uc $plate;
$plate =~ tr/O/0/; # we use zero for letter-oh
die "$plate is invalid.\n"
unless $plate =~ m/^[A-Z0-9]{2,7}$/
and $plate !~ m/^\d+$/; # no all-digit plates
my $browser = LWP::UserAgent->new;
my $response = $browser->post(
'http://plates.ca.gov/search/search.php3',
[
'plate' => $plate,
'search' => 'Check Plate Availability'
],
);
die "Error: ", $response->status_line
unless $response->is_success;
if($response->content =~ m/is unavailable/) {
print "$plate is already taken.\n";
} elsif($response->content =~ m/and available/) {
print "$plate is AVAILABLE!\n";
} else {
print "$plate... Can't make sense of response?!\n";
}
exit;http://www.oreilly.com/news/bikeweek_day1.html
scheme://username@server:port/path?query
ftp://ftp.is.co.za/rfc/rfc1808.txt
ftp://ftp.is.co.za/rfc/rfc1808.txt and
fTp://ftp.Is.cO.ZA/rfc/rfc1808.txt are the
same, but ftp://ftp.is.co.za/rfc/rfc1808.txt and
ftp://ftp.is.co.za/rfc/RFC1808.txt are not,
unless that server happens to forgive case differences in requests.
http://www.oreilly.com/news/bikeweek_day1.html
scheme://username@server:port/path?query
ftp://ftp.is.co.za/rfc/rfc1808.txt
ftp://ftp.is.co.za/rfc/rfc1808.txt and
fTp://ftp.Is.cO.ZA/rfc/rfc1808.txt are the
same, but ftp://ftp.is.co.za/rfc/rfc1808.txt and
ftp://ftp.is.co.za/rfc/RFC1808.txt are not,
unless that server happens to forgive case differences in requests.
- _ . ! ~ * ' , : @ & + $ ( ) /
- _ . ! ~ * ' ( )
%20, because space is character 32 in ASCII, and
the number 32 expressed in hexadecimal is 20.
GET /daily/2001/01/05/1.html HTTP/1.1
Host: www.suck.com
User-Agent: Super Duper Browser 14.6
blank line
HTTP/1.1 200 OK Content-type: text/html Content-length: 24204 blank line and then 24,204 bytes of HTML code
HTTP/1.1 404 Not Found
Content-type: text/html
Content-length: 135
<html><head><title>Not Found</title></head><body>
Sorry, the object you requested was not found.
</body><html>
and then the server closes the connection
HTTP/1.1.
User-Agent:
SuperDuperBrowser/14.6). In versions of HTTP
previous to 1.1, header lines were optional. In HTTP 1.1, the
Host: header must be present, to name the server
to which the browser is talking. This is the
"server" part of the URL being
requested (e.g., print the document.
get( )
function
takes a
URL and returns the body of the document:
$document = get("http://www.suck.com/daily/2001/01/05/1.html");get(
) returns undef. Incidentally, if LWP
requests that URL and the server replies that it has moved to some
other URL, LWP requests that other URL and returns that.
get( )
function, there's no way to set headers to be sent
with the GET request or get more information about the response, such
as the status code. These are important things, because some web
servers have copies of documents in different languages and use the
HTTP language header to determine which document to return. Likewise,
the HTTP response code can let us distinguish between permanent
failures (e.g., "404 Not Found")
and temporary failures ("505 Service [Temporarily]
Unavailable").
getstore( )
function, which writes the document to a file and returns the status
code from the response:
$status = getstore("http://www.suck.com/daily/2001/01/05/1.html",
"/tmp/web.html");do_GET( ) subroutine shown in Example 2-5.
use LWP;
my $browser;
sub do_GET {
# Parameters: the URL,
# and then, optionally, any header lines: (key,value, key,value)
$browser = LWP::UserAgent->new( ) unless $browser;
my $resp = $browser->get(@_);
return ($resp->content, $resp->status_line, $resp->is_success, $resp)
if wantarray;
return unless $resp->is_success;
return $resp->content;
}do_GET( )
is
given in Chapter 3. Until
then, we'll be using it without fully understanding
how it works.
do_GET( ) function in either
scalar or list context:
doc = do_GET(URL [header, value, ...]);
(doc, status, successful, response) = do_GET(URL [header, value, ...]);undef if there is an error. In list context, it
returns the document (if any), the status line from the HTTP
response, a Boolean value indicating whether the status code
indicates a successful response, and an object we can interrogate to
find out more about the response.
undef discards that
value. For example, this is how you fetch a document into a string
and learn whether it is successful:
($doc, undef, $successful, undef) = do_GET('http://www.suck.com/');do_GET(
) let you add headers to the request. For example, to
attempt to fetch the German language version of the European Union
home page:
$body = do_GET("http://europa.eu.int/",
"Accept-language" => "de",
);do_GET( ) function that we'll
use in this chapter provides the same basic convenience as
LWP::Simple's get( ) but without
the limitations.
Dear Mr. Hing:I have read with intense interest your letter detailing your struggle with the question of whether your favorite savory spice should be spelled in English as "asafoetida" or whether you should heed your secretary's admonishment that all the kids today are spelling it "asafetida."I could note various factors potentially involved here; notably, the fact that in many cases, British/Commonwealth spelling retains many "ae"/"oe" digraphs whereas U.S./Canadian spelling strongly prefers an "e" ("foetus"/"fetus," etc.). But I will instead be (merely) democratic about this and note that if you use AltaVista (http://altavista.com, a well-known search engine) to run a search on "asafetida," it will say that across all the pages that AltaVista has indexed, there are "about 4,170" matched; whereas for "asafoetida" there are many more, "about 8,720."So you, with the "oe," are apparently in the majority.
% alta_count asafetida asafoetida asafetida: 4,170 matches asafoetida: 8,720 matches
http://altavista.com, putting a word or
phrase in the search box, and hitting the Submit button yields a
result page with a URL that looks like this:
http://www.altavista.com/sites/search/web?q=%22asafetida%22&kl=XX
http://babelfish.altavista.com) is a service
that lets you translate text from one human language into another. If
you're accessing Babelfish from a browser, you see
an HTML form where you paste in the text you want translated, specify
the language you want it translated from and to, and hit Translate.
After a few seconds, a new page appears, with your translation.
urltext = I like pie lp = en_fr enc = utf8
POST /translate.dyn HTTP/1.1 Host: babelfish.altavista.com User-Agent: SuperDuperBrowser/14.6 Content-Type: application/x-www-form-urlencoded Content-Length: 40 urltext=I%20like%20pie&lp=en_fr&enc=utf8
do_GET( ) function to automate a
GET query, Example 2-7 uses a do_POST(
) function to automate POST queries.
use LWP;
my $browser;
sub do_POST {
# Parameters:
# the URL,
# an arrayref or hashref for the key/value pairs,
# and then, optionally, any header lines: (key,value, key,value)
$browser = LWP::UserAgent->new( ) unless $browser;
my $resp = $browser->post(@_);
return ($resp->content, $resp->status_line, $resp->is_success, $resp)
if wantarray;
return unless $resp->is_success;
return $resp->content;
}do_POST( ) like this:
doc = do_POST(URL, [form_ref, [headers_ref]]);
(doc, status, success, resp) = do_GET(URL, [form_ref, [headers_ref]]);do_GET( ). The form_ref
parameter is a reference to a hash containing the form parameters.
The headers_ref parameter is a reference
to a hash containing headers you want sent in the request.
my ($content, $message, $is_success) = do_POST( 'http://babelfish.altavista.com/translate.dyn', [ 'urltext' => "I like pie", 'lp' => "en_fr", 'enc' => 'utf8' ], );
$is_success will tell us
this), $content will be an HTML page that contains
the translation text. At time of this writing, the translation is
inside the only textarea element on the page, so
it can be extracted with just this regexp:
$content =~ m{<textarea.*?>(.*?)</textarea>}is;$1, if the match
succeeded.
sub translate {
my ($text, $language_path) = @_;
my ($content, $message, $is_success) = do_POST(
'http://babelfish.altavista.com/translate.dyn',
[ 'urltext' => $text, 'lp' => $language_path, 'enc' => 'utf8' ],
);
die "Error in translation $language_path: $message\n"
unless $is_success;
if ($content =~ m{<textarea.*?>(.*?)</textarea>}is) {
my $translation;
$translation = $1;
# Trim whitespace:
$translation =~ s/\s+/ /g;
$translation =~ s/^ //s;
$translation =~ s/ $//s;
return $translation;
} else {
die "Can't find translation in response to $language_path";
}
}translate( ) subroutine constructs
the
request and extracts the translation from the response, cleaning up
any whitespace that may surround it. If the request
couldn't be completed, the subroutine throws an
exception by calling die( ).
translate( ) subroutine could be used to
automate on-demand translation of important content from one language
to another. But machine translation is still a fairly new technology,
and the real value of it is to be found in translating from English
into another language and then back into English, just for fun.
(Incidentally, there's a CPAN module that takes care
of all these details for you, called Lingua::Translate, but here
we're interested in how to carry out the task,
rather than whether someone's already figured it out
and posted it to CPAN.)
use LWP::UserAgent; use HTTP::Response;
use LWP; # same as previous two lines
use LWP::UserAgent; use HTTP::Response;
use LWP; # same as previous two lines
env_proxy():
use LWP::UserAgent; my $browser = LWP::UserAgent->new( ); $browser->env_proxy( ); # if we're behind a firewall
get( ),
head( ), or post( ) methods, to
get what's at a URL, or to perform HTTP HEAD or POST
requests on it. For example:
$url = 'http://www.guardian.co.uk/'; my $response = $browser->get($url);
die "Hmm, error \"", $response->status_line( ),
"\" when getting $url" unless $response->is_success( );
my $content_type = $response->content_type( );
die "Hm, unexpected content type $content_type from $url"
unless $content_type eq 'text/html';
my $content = $response->content( );
die "Odd, the content from $url is awfully short!"
if length($content) < 3000;
if($content =~ m/Madonna|Arkansas/i) {
print "<!-- The news today is IMPORTANT -->\n",
$content;
} else {
print "$url has no news of ANY CONCEIVABLE IMPORTANCE!\n";
}do_GET( )
and do_POST( ) functions introduced in Chapter 2. Let's look at
do_GET( ) first.
$browser variable that will hold the user agent.
It's declared outside the scope of the
do_GET( ) subroutine, so it's
essentially a static variable, retaining its value between calls to
the subroutine. For example, if you turn on support for HTTP cookies,
this browser could persist between calls to do_GET(
), and cookies set by the server in one call would be sent
back in a subsequent call.
use LWP;
my $browser;
sub do_GET {$browser = LWP::UserAgent->new( ) unless $browser;
$browser->env_proxy();
my $response = $browser->request(@_);
return($response->content, $response->status_line, $response->is_success, $response) if wantarray;
undef:
return unless $response->is_success;
return $response->content; }
do_POST( ) subroutine is just like
do_GET( ), only it uses the post(
) method instead of get( ).
new( ) constructor
makes a user agent object:
$browser = LWP::UserAgent->new(%options);|
Key
|
Default
|
|---|---|
agent |
"libwww-perl/#.###"
|
conn_cache |
undef |
cookie_jar |
undef |
from |
undef |
max_size |
undef |
parse_head |
1 |
protocols_allowed |
undef |
protocols_forbidden |
undef |
requests_redirectable |
['GET', 'HEAD'] |
timeout |
180 |
clone( )
method:
$copy = $browser->clone( );
->new( ).
HTTP::Response objects are a notable exception. You never need to
call HTTP::Response->new( ) to make them;
instead, you just get them back as the result of a request made with
one of the request methods (get( ), post(
), and head( )).
code( ) method returns the HTTP
status code:
print "HTTP status: ", $response->code( ), "\n";
HTTP status: 404
is_success( ) returns a true value if the response
had a successful HTTP status code, or false if it
didn't (e.g., 404, 403, 500, etc.). Always check
your responses, like so:
die "Couldn't get the document" unless $response->is_success( );
# Given $response and $url ... die "Error getting $url\n", $response->status_line unless $response->is_success( );
status_line( ) method returns
the entire HTTP status line:
$sl = $response->status_line( );
$resp = $browser->get("http://www.cpan.org/nonesuch");
print $response->status_line( );
404 Not Found
code( )
method:
$code = $response->code( );
message(
) method:
$msg = $response->message( );
$resp = $browser->get("http://www.cpan.org/nonesuch");
print $response->code(), " (that means ", $response->message( ), " )\n";
404 (that means Not Found)
is_error( ), is_success( ),
is_redirect( ), and is_info( ).
They return true if the status code corresponds to an error, a
successful fetch, a redirection, or informational (e.g.,
"102 Processing").
use URI;
my $url = URI->new('http://user:pass@example.int:4345/hello.php?user=12');
print "Scheme: ", $url->scheme( ), "\n";
print "Userinfo: ", $url->userinfo( ), "\n";
print "Hostname: ", $url->host( ), "\n";
print "Port: ", $url->port( ), "\n";
print "Path: ", $url->path( ), "\n";
print "Query: ", $url->query( ), "\n";
Scheme: http
Userinfo: user:pass
Hostname: example.int
Port: 4345
Path: /hello.php
Query: user=12
use URI;
my $url = URI->new('http://user:pass@example.int:4345/hello.php?user=12');
print "Scheme: ", $url->scheme( ), "\n";
print "Userinfo: ", $url->userinfo( ), "\n";
print "Hostname: ", $url->host( ), "\n";
print "Port: ", $url->port( ), "\n";
print "Path: ", $url->path( ), "\n";
print "Query: ", $url->query( ), "\n";
Scheme: http
Userinfo: user:pass
Hostname: example.int
Port: 4345
Path: /hello.php
Query: user=12
host( ) can also
alter the parts of a URL, using the familiar convention that
$object->method reads an
attribute's value and
$object->method(
newvalue
)
alters an attribute:
use URI;
my $uri = URI->new("http://www.perl.com/I/like/pie.html");
$uri->host('testing.perl.com');
print $uri,"\n";
http://testing.perl.com/I/like/pie.html
new( ) constructor:
$url = URI->new(url [, scheme ]);
staff/alicia.html),
scheme determines the scheme you plan for
this URL to have (http://phee.phye.phoe.fm/thingamajig/stuff.html
http)" and "on
the same host
(phee.phye.phoe.fm)," and the
explicit information is "with the path
/also.html." So this is
equivalent to an absolute URL of:
http://phee.phye.phoe.fm/also.html
.." means
"up one level",
"." means
"in this level", and anything else
means "in this directory". So a
relative URL of just zing.xml interpreted
relative to
http://phee.phye.phoe.fm/thingamajig/stuff.html
yields this absolute URL:
http://phee.phye.phoe.fm/thingamajig/zing.xml
http://phee.phye.phoe.fm/hi_there.jpg
.." tells us to
go up one level, giving us
http://phee.phye.phoe.fm/. Append
hi_there.jpg giving us the URL you see above.