The Code
This script takes a Purchase Circle ID and
returns the books listed. Create a file called
get_circle.pl
and add the following code:
#!/usr/bin/perl
# get_circle.pl
# A script to scrape Amazon to retrieve purchase circle products
# Usage: perl get_circle.pl <circleID>
#Take the asin from the command-line
my $circleID =shift @ARGV or die "Usage:perl get_circle.pl <circleID>\n";
#Assemble the URL
my $url = "http://amazon.com/o/tg/cm/browse-communities/-/" .
$circleID . "/t/";
use strict;
use LWP::Simple;
#Request the URL
my $content = get($url);
die "Could not retrieve $url" unless $content;
my $circle = (join '', $content);
while ($circle =~ m!<title>(.*?)</title>!mgis) {
print $1 . "\n\n";
}
while ($circle =~ m!<td.*?<b><a.*?-/(.*?)[?/].*?>(.*?)</a></b>.*?by&return;
(.*?)<br>.*?</td>!mgis) {
my($asin,$title,$author) = ($1||'',$2||'',$3||'');
#Print the results
print $title . "\n" .
"by " . $author . "\n" .
"ASIN: " . $asin .
"\n\n";
}
One thing to note about this code is that it passes the
/t/ URL argument to return a text-only version of
the purchase circle page. Text-only pages have less HTML, which means
that fewer bytes are flying around and it's generally easier to
scrape for information.