O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  



HACK
#30
Restrict Searches to Top-Level Results
Separate out search results by the depth at which they appear in a site
The Code
[Discuss (0) | Link to this hack]

The Code

Save the code as a CGI script ["How to Run the Hacks" in the Preface] named gootop.cgi:

#!/usr/local/bin/perl
# gootop.cgi
# Separates out top-level and sub-level results.
# gootop.cgi is called as a CGI with form input.
     
# Your Google API developer's key.
my $google_key='insert key here';
     
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
     
# Number of times to loop, retrieving 10 results at a time.
my $loops = 10;
     
use strict;
     
use SOAP::Lite;
use CGI qw/:standard *table/;
     
print
  header( ),
  start_html("GooTop"),
  h1("GooTop"),
  start_form(-method=>'GET'),
  'Query: ', textfield(-name=>'query'),
  '   ',
  submit(-name=>'submit', -value=>'Search'),
  end_form( ), p( );
     
my $google_search  = SOAP::Lite->service("file:$google_wdsl");
     
if (param('query')) {
  my $list = { 'toplevel' => [], 'sublevel' => [] };
     
  for (my $offset = 0; $offset <= $loops*10; $offset += 10) {
    my $results = $google_search ->
      doGoogleSearch(
        $google_key, param('query'), $offset,
        10, "false", "",  "false", "", "latin1", "latin1"
      );
     
    foreach (@{$results->{'resultElements'}}) {
      push @{
        $list->{ $_->{URL} =~ m!://[^/]+/?$!
        ? 'toplevel' : 'sublevel' }
      },
      p(
        b($_->Restrict Searches to Top-Level Results||'no title'), br( ),
        a({href=>$_->{URL}}, $_->{URL}), br( ),
        i($_->{snippet}||'no snippet')
      );
    }
  }
     
  print
    h2('Top-Level Results'),
    join("\n", @{$list->{toplevel}}),
    h2('Sub-Level Results'),
    join("\n", @{$list->{sublevel}});
}
     
print end_html;

Gleaning a decent number of top-level domain results means throwing out quite a bit. It's for this reason that this script runs the specified query a number of times, as specified by my $loops = 10;, each loop picking up 10 results, some subset being top-level. To alter the number of loops per query, simply change the value of $loops. Realize that each invocation of the script burns through $loops number of queries, so be sparing and don't bump that number up to anything ridiculous; even 100 will eat through a daily allotment in just 10 invocations.

The heart of the script, and what differentiates it from your average Google API Perl script , lies in the code that follows.

push @{
  $list->{ $_->{URL} =~ m!://[^/]+/?$!
  ? 'toplevel' : 'sublevel' }
}

What that jumble of characters is scanning for is :// (as in http://) followed by anything other than a / (slash), thereby sifting between top-level finds (e.g., http://www.berkeley.edu/welcome.html) and sublevel results (e.g., http://www.berkeley.edu/students/john_doe/my_dog.html). If you're Perl savvy, you may have noticed the trailing /?$; this allows for the eventuality that a top-level URL ends with a slash (e.g., http://www.berkeley.edu/), as is often true.


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.