O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Spidering Hacks
By Kevin Hemenway, Tara Calishain
October 2003
More Info

HACK
#85
Aggregating Multiple Search Engine Results
Even though Google may solve all your searching needs on a daily basis, there may come a time when you need a "super search"—something that queries multiple search engines or databases at once
The Code
[Discuss (0) | Link to this hack]

The Code

The following short piece of code demonstrates the server portion, which searches for a ./plugins directory and executes all the code within:

#!/usr/bin/perl -w

# aggsearch - aggregate searching engine
#
# This file is distributed under the same licence as Perl itself.
#
# by rik - ora@rikrose.net

######################
# support stage      #
######################

use strict;

# change this, if neccessary.
my $pluginDir = "plugins";

# if the user didn't enter any search terms, yell at 'em.
unless (@ARGV) { print 'usage: aggsearch "search terms"', "\n"; exit; }

# this routine actually executes the current
# plug-in, receives the tabbed data, and sticks
# it into a result array for future printing.
sub query {
    my ($plugin, $args, @results) = (shift, shift);
    my $command = $pluginDir . "/" . $plugin . " " . (join " ", @$args);
    open RESULTS, "$command |" or die "Plugin $plugin failed!\n";
    while (<RESULTS>) {
        chomp; # remove new line.
        my ($url, $name) = split /\t/;
        push @results, [$name, $url];
    } close RESULTS;

    return @results;
}

######################
# find plug-ins stage #
######################

opendir PLUGINS, $pluginDir
   or die "Plugin directory \"$pluginDir\"".
     "not found! Please create, and populate\n";
my @plugins = grep {
    stat $pluginDir . "/$_"; -x _ && ! -d _ && ! /\~$/;
} readdir PLUGINS; closedir PLUGINS;


######################
# query stage        #
######################

for my $plugin (@plugins){
    print "$plugin results:\n";
    my @results = query $plugin, \@ARGV;
    for my $listref (@results){
        print " $listref->[0] : $listref->[1] \n"
    } print "\n";
}

exit 0;

The plug-ins themselves are even smaller than the server code, since their only purpose is to return a tab-delimited set of results. Our first sample looks through the freshmeat.net (http://freshmeat.net) software site:

#!/usr/bin/perl -w

# Example freshmeat searching plug-in
#
# This file is distributed under the same licence as Perl itself.
#
# by rik - ora@rikrose.net

use strict;
use LWP::UserAgent;
use HTML::TokeParser;

# create the URL from our incoming query.
my $url = "http://freshmeat.net/search-xml?q=" . join "+", @ARGV;

# download the data.
my $ua = LWP::UserAgent->new(  );
$ua->agent('Mozilla/5.0');
my $response = $ua->get($url);
die $response->status_line . "\n"
  unless $response->is_success;

my $stream = HTML::TokeParser->new (\$response->content) or die "\n";
while (my $tag = $stream->get_tag("match")){
    $tag = $stream->get_tag("projectname_full");
    my $name = $stream->get_trimmed_text("/projectname_full");
    $tag = $stream->get_tag("url_homepage");
    my $url = $stream->get_trimmed_text("/url_homepage");
    print "$url\t$name\n";
}

Our second sample uses the Google API:

#!/usr/bin/perl -w

# Example Google searching plug-in

use strict;
use warnings;
use SOAP::Lite;

# all the Google information
my $google_key  = "your API key here";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");
my $query       = join "+", @ARGV;

# do the search...
my $result = $gsrch->doGoogleSearch($google_key, $query,
                          1, 10, "false", "",  "false",
                          "lang_en", "", "");

# and print the results.
foreach my $hit (@{$result->{'resultElements'}}){
   print "$hit->{URL}\t$hit->Aggregating Multiple Search Engine Results\n";
}

Our last example covers AlltheWeb.com:

#!/usr/bin/perl -w

# Example alltheweb searching plug-in
#
# This file is distributed under the same licence as Perl itself.
#
# by rik - ora@rikrose.net

use strict;
use LWP::UserAgent;
use HTML::TokeParser;

# create the URL from our incoming query.
my $url = "http://www.alltheweb.com/search?cat=web&cs=iso-8859-1" .
          "&q=" . (join "+", @ARGV) . "&_sb_lang=en";

print $url;
# download the data.
my $ua = LWP::UserAgent->new(  );
$ua->agent('Mozilla/5.0');
my $response = $ua->get($url);
die $response->status_line . "\n"
  unless $response->is_success;

my $stream = HTML::TokeParser->new (\$response->content) or die "\n";
while (my $tag = $stream->get_tag("p")){
    $tag = $stream->get_tag("a");
    my $name = $stream->get_trimmed_text("/a");
    last if $name eq "last 10 queries";
    my $url = $tag->[1]{href};
    print "$url\t$name\n";
}


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.