The Code
Save the code as a CGI script ["How to Run the
Hacks" in the Preface] named
gootop.cgi:
#!/usr/local/bin/perl
# gootop.cgi
# Separates out top-level and sub-level results.
# gootop.cgi is called as a CGI with form input.
# Your Google API developer's key.
my $google_key='insert key here';
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
# Number of times to loop, retrieving 10 results at a time.
my $loops = 10;
use strict;
use SOAP::Lite;
use CGI qw/:standard *table/;
print
header( ),
start_html("GooTop"),
h1("GooTop"),
start_form(-method=>'GET'),
'Query: ', textfield(-name=>'query'),
' ',
submit(-name=>'submit', -value=>'Search'),
end_form( ), p( );
my $google_search = SOAP::Lite->service("file:$google_wdsl");
if (param('query')) {
my $list = { 'toplevel' => [], 'sublevel' => [] };
for (my $offset = 0; $offset <= $loops*10; $offset += 10) {
my $results = $google_search ->
doGoogleSearch(
$google_key, param('query'), $offset,
10, "false", "", "false", "", "latin1", "latin1"
);
foreach (@{$results->{'resultElements'}}) {
push @{
$list->{ $_->{URL} =~ m!://[^/]+/?$!
? 'toplevel' : 'sublevel' }
},
p(
b($_->Restrict Searches to Top-Level Results||'no title'), br( ),
a({href=>$_->{URL}}, $_->{URL}), br( ),
i($_->{snippet}||'no snippet')
);
}
}
print
h2('Top-Level Results'),
join("\n", @{$list->{toplevel}}),
h2('Sub-Level Results'),
join("\n", @{$list->{sublevel}});
}
print end_html;
Gleaning a decent number of top-level domain results means throwing
out quite a bit. It's for this reason that this
script runs the specified query a number of times, as specified by
my $loops = 10;, each loop picking up 10 results,
some subset being top-level. To alter the number of loops per query,
simply change the value of $loops. Realize that
each invocation of the script burns through $loops
number of queries, so be sparing and don't bump that
number up to anything ridiculous; even 100 will eat through a daily
allotment in just 10 invocations.
The heart of the script, and what differentiates it from your average
Google API Perl script , lies in
the code that follows.
push @{
$list->{ $_->{URL} =~ m!://[^/]+/?$!
? 'toplevel' : 'sublevel' }
}
What that jumble of characters is scanning for is
:// (as in http://) followed by
anything other than a / (slash), thereby sifting
between top-level finds (e.g., http://www.berkeley.edu/welcome.html) and
sublevel results (e.g., http://www.berkeley.edu/students/john_doe/my_dog.html).
If you're Perl savvy, you may have noticed the
trailing /?$; this allows for the eventuality that
a top-level URL ends with a slash (e.g., http://www.berkeley.edu/), as is often true.