The Code
Save the following code as goonow.pl. Be sure to
replace insert key here with your Google
API key along the way.
#!/usr/local/bin/perl -w
# goonow.pl
# Feeds queries specified in a text file to Google, querying
# for recent additions to the Google index. The script appends
# to CSV files, one per query, creating them if they don't exist.
# usage: perl goonow.pl [query_filename]
# My Google API developer's key.
my $google_key='insert key here';
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
use strict;
use SOAP::Lite;
use Time::JulianDay;
$ARGV[0] or die "usage: perl goonow.pl [query_filename]\n";
my $julian_date = int local_julian_day(time) - 2;
my $google_search = SOAP::Lite->service("file:$google_wdsl");
open QUERIES, $ARGV[0] or die "Couldn't read $ARGV[0]: $!";
while (my $query = <QUERIES>) {
chomp $query;
warn "Searching Google for $query\n";
$query .= " daterange:$julian_date-$julian_date";
(my $outfile = $query) =~ s/\W/_/g;
open (OUT, ">> $outfile.csv")
or die "Couldn't open $outfile.csv: $!\n";
my $results = $google_search ->
doGoogleSearch(
$google_key, $query, 0, 10, "false", "", "false",
"", "latin1", "latin1"
);
foreach (@{$results->{'resultElements'}}) {
print OUT '"' . join('","', (
map {
s!\n!!g; # drop spurious newlines
s!<.+?>!!g; # drop all HTML tags
s!"!""!g; # double escape " marks
$_;
} @$_{'title','URL','snippet'}
) ) . "\"\n";
}
}
You'll notice that GooNow checks the day before
yesterday's rather than yesterday's
additions (my$julian_date=intlocal_julian_day(time)-2;). Google indexes some pages very frequently;
these show up in yesterday's additions and really
bulk up your search results. So if you search for
yesterday's results in addition to updated pages,
you'll get a lot of noise, pages that Google indexes
every day, rather than the fresh content that you're
after. Skipping back one more day is a nice hack to get around the
noise.