
|
|
|
Scraping Amazon.com Customer Advice
Screen scraping can give you access to
Amazon.com community features not yet implemented through
Amazon.com's public Web Services API. In this hack,
we'll implement a script to scrape customer buying
advice
The Code
[Discuss (0) | Link to this hack] |
The CodeThis Perl script splits the advice page into two variables, based on
the headings "in addition to" and
"instead of." It then loops through
those sections, using regular expressions to match the
products' information. The script then formats and
prints the information. Save the following script to a file called
get_advice.pl: #!/usr/bin/perl -w
# get_advice.pl
#
# A script to scrape Amazon to retrieve customer buying advice
# Usage: perl get_advice.pl <asin>
use strict; use LWP::Simple;
# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_advice.pl <asin>\n";
# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=advice";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
# Get our matching data.
my ($inAddition) = (join '', $content) &return;
=~ m!in addition to(.*?)(instead of)?</td></tr>!mis;
my ($instead) = (join '', $content) &return;
=~ m!recommendations instead of(.*?)</table>!mis;
# Look for "in addition to" advice.
if ($inAddition) { print "-- In Addition To --\n\n";
while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/&return;
(.*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
my ($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
$title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML
print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
}
}
# Look for "instead of" advice.
if ($instead) { print "-- Instead Of --\n\n";
while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.&return;
*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
my ($place,$thisAsin,$title,$number) &return;
= ($1||'',$2||'',$3||'',$4||'');
$title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML
print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
}
}
|
O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website:
| Customer Service:
| Book issues:
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
|
|