O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  



HACK
#42
Compare Google's Results with Other Search Engines
Compare Google search results with results from other search engines
The Code
[Discuss (0) | Link to this hack]

The Code

Save the following code as a CGI script ["How to Run the Hacks" in the Preface] named google_compare.cgi in your web site's cgi-bin directory:

#!/usr/local/bin/perl
# google_compare.cgi
# Compares Google results against those of other search engines.
     
# Your Google API developer's key.
my $google_key='insert key here';
     
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
     
use strict;
     
use SOAP::Lite;
use LWP::Simple qw(get);
use CGI qw{:standard};
     
my $googleSearch = SOAP::Lite->service("file:$google_wdsl");
     
# Set up our browser output.
print "Content-type: text/html\n\n";
print "<html><title>Google Compare Results</title><body>\n";
     
# Ask and we shell receive.
my $query = param('query');
unless ($query) {
   print "<h1>No query defined.</h1></body></html>\n\n";
   exit; # If there's no query there's no program. 
}
     
# Spit out the original before we encode.
print "<h1>Your original query was '$query'.</h1>\n";
     
$query =~ s/\s/\+/g ;  #changing the spaces to + signs
$query =~ s/\"/%22/g;  #changing the quotes to %22
     
# Create some hashes of queries for various search engines.  
# We have four types of queries ("plain", "com", "edu", and "org"), 
# and three search engines ("Google", "AlltheWeb", and "Altavista"). 
# Each engine has a name, query, and regular expression used to 
# scrape the results.
my $query_hash = { 
   plain => {
      Google => { name => "Google", query => $query, },
      AlltheWeb => {
         name   => "AlltheWeb",
         regexp => '<span class="ofSoMany">(.*)</span>',
         query  => "http://www.alltheweb.com/search?cat=web&q=$query",
      },
      Altavista => {
         name  => "Altavista", 
         regexp => 'AltaVista found (.*) results',
         query => "http://www.altavista.com/sites/search/web?q=$query",
      }
   },
   com => {
      Google => { name => "Google", query => "$query site:com", },
      AlltheWeb => {
         name   => "AlltheWeb",
         regexp => '<span class="ofSoMany">(.*)</span>',
         query  => "http://www.alltheweb.com/ search?cat=web&q=$query+domain%3Acom",
      },
      Altavista => {
         name  => "Altavista", 
         regexp => 'AltaVista found (.*) results',
         query => "http://www.altavista.com/sites/search/web?q=$query+domain%3Acom",
      }
   },
   org => {
      Google => { name => "Google", query => "$query site:org", },
      AlltheWeb => {
         name   => "AlltheWeb",
         regexp => '<span class="ofSoMany">(.*)</span>',
         query  => "http://www.alltheweb.com/
         search?cat=web&q=$query+domain%3Aorg",
      },
      Altavista => {
         name  => "Altavista", 
         regexp => 'AltaVista found (.*) results',
         query => "http://www.altavista.com/sites/search/web?q=$query+domain%3Aorg",
      }
   },
   net => {
      Google => { name => "Google", query => "$query site:net", },
      AlltheWeb => {
         name   => "AlltheWeb",
         regexp => '<span class="ofSoMany">(.*)</span>',
         query  => "http://www.alltheweb.com/search?cat=web&q=$query+domain%3Anet",
      },
      Altavista => {
         name  => "Altavista", 
         regexp => 'AltaVista found (.*) results',
         query => "http://www.altavista.com/sites/search/web?q=$query+domain%3Anet",
      }
   }
};
     
# Now we loop through each of our query types
# under the assumption there's a matching
# hash that contains our engines and string.
foreach my $query_type (keys (%$query_hash)) {
   print "<h2>Results for a '$query_type' search:</h2>\n";
     
   # Now, loop through each engine we have and get/print the results.
   foreach my $engine (values %{$query_hash->{$query_type}}) {
      my $results_count; 
     
      # If this is Google, we use the API and not port 80.
      if ($engine->{name} eq "Google") {
         my $result = $googleSearch->doGoogleSearch(
             $google_key, $engine->{query}, 0, 1,
             "false", "", "false", "", "latin1", "latin1");
         $results_count = $result->{estimatedTotalResultsCount};
         # The Google API doesn't format numbers with commas.
         my $rresults_count = reverse $results_count;
         $rresults_count =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
         $results_count = scalar reverse $rresults_count;
      }
     
      # It's not Google, so we GET like everyone else.
      elsif ($engine->{name} ne "Google") {
         my $data = get($engine->{query}) or print "ERROR: $!";
         $data =~ /$engine->{regexp}/; $results_count = $1 || 0;
      }
     
      # and print out the results.
      print "<strong>$engine->{name}</strong>: $results_count<br />\n";
   }
}


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.