O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Spidering Hacks
By Kevin Hemenway, Tara Calishain
October 2003
More Info

HACK
#54
Scraping Amazon.com Customer Advice
Screen scraping can give you access to Amazon.com community features not yet implemented through Amazon.com's public Web Services API. In this hack, we'll implement a script to scrape customer buying advice
The Code
[Discuss (0) | Link to this hack]

The Code

This Perl script splits the advice page into two variables, based on the headings "in addition to" and "instead of." It then loops through those sections, using regular expressions to match the products' information. The script then formats and prints the information.

Save the following script to a file called get_advice.pl:

#!/usr/bin/perl -w
# get_advice.pl
#
# A script to scrape Amazon to retrieve customer buying advice
# Usage: perl get_advice.pl <asin>
use strict; use LWP::Simple;

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_advice.pl <asin>\n";

# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=advice";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

# Get our matching data.
my ($inAddition) = (join '', $content) &return;
    =~ m!in addition to(.*?)(instead of)?</td></tr>!mis;
my ($instead)    = (join '', $content) &return;
    =~ m!recommendations instead of(.*?)</table>!mis;

# Look for "in addition to" advice.
if ($inAddition) { print "-- In Addition To --\n\n";
   while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/&return;
(.*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
       my ($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
       $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
       print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
   }
}

# Look for "instead of" advice.
if ($instead) { print "-- Instead Of --\n\n";
    while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.&return;
*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
        my ($place,$thisAsin,$title,$number) &return;
          = ($1||'',$2||'',$3||'',$4||'');
        $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
        print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
    }
}


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.