O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Amazon Hacks
By Paul Bausch
August 2003
More Info

HACK
#40
Scrape Customer Advice
Screen scraping can give you access to community features not yet implemented through the API—like customer buying advice
The Code
[Discuss (1) | Link to this hack]

The Code

This Perl script, get_advice.pl, splits the advice page into two variables based on the headings "in addition to" and "instead of." It then loops through those sections, using regular expressions to match the products' information. The script then formats and prints the information.

#!/usr/bin/perl
# get_advice.pl
# A script to scrape Amazon to retrieve customer buying advice
# Usage: perl get_advice.pl <asin>

#Take the asin from the command-line
my $asin =shift @ARGV or die "Usage:perl get_advice.pl <asin>\n";

#Assemble the URL
my $url = "http://amazon.com/o/tg/detail/-/" . $asin . 
          "/?vi=advice";

#Set up unescape-HTML rules
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

use strict;
use LWP::Simple;

#Request the URL
my $content = get($url);
die "Could not retrieve $url" unless $content;

my($inAddition) = (join '', $content) =~ m!in addition to(.*?)<tr>&return;
<td colspan=3><br></td></tr>!mis;
my($instead) = (join '', $content) =~ m!recommendations instead of(.*?)</&return;
table>!mis;

#Loop through the HTML looking for "in addition" advice
print "-- In Addition To --\n\n";
while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.&return;
*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
    my($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
    $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
    #Print the results
    print $place . " " . 
          $title . " (" . $thisAsin . ")\n(" . 
          "Recommendations: " . $number . ")" . 
          "\n\n";
}

#Loop through the HTML looking for "instead of" advice
print "-- Instead Of --\n\n";
while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.*?)/.&return;
*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
    my($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
    $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
    #Print the results
    print $place . " " . 
          $title . " (" . $thisAsin . ")\n(" . 
          "Recommendations: " . $number . ")" . 
          "\n\n";
}


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.