O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  



HACK
#45
Glean Weblog-Free Google Results
With so many weblogs being indexed by Google, you might worry about too much emphasis on the hot topic of the moment. In this hack, we'll show you how to remove the weblog factor from your Google results
The Code
[Discuss (0) | Link to this hack]

The Code

Save the following code ["How to Run the Hacks" in the Preface] to a file called googletech.cgi.

TIP

You'll need the XML::Simple and SOAP::Lite Perl modules to run this hack.

#!/usr/bin/perl -w
# googletech.cgi
# Getting Google results
# without getting weblog results.
use strict;
use SOAP::Lite;
use XML::Simple;
use CGI qw(:standard);
use HTML::Entities ( );
use LWP::Simple qw(!head);
     
my $technoratikey = "insert technorati key here";
my $googlekey = "insert google key here";
     
# Set up the query term
# from the CGI input.
my $query = param("q");
     
# Initialize the SOAP interface and run the Google search.
my $google_wdsl = "http://api.google.com/GoogleSearch.wsdl";
my $service = SOAP::Lite->service->($google_wdsl);
     
# Start returning the results page;
# do this now to prevent timeouts.
my $cgi = new CGI;
     
print $cgi->header( );
print $cgi->start_html(-title=>'Blog Free Google Results');
print $cgi->h1('Blog Free Results for '. "$query");
print $cgi->start_ul( );
     
# Go through each of the results.
foreach my $element (@{$result->{'resultElements'}}) {
     
    my $url = HTML::Entities::encode($element->{'URL'});
     
    # Request the Technorati information for each result.
    my $technorati_result = get("http://api.technorati.com/bloginfo?".
                                "url=$url&key=$technoratikey");
     
    # Parse this information.
    my $parser = new XML::Simple;
    my $parsed_feed = $parser->XMLin($technorati_result);
     
    # If Technorati considers this site to be a weblog,
    # go onto the next result. If not, display it, and then go on.
    if ($parsed_feed->{document}{result}{weblog}{name}) { next; }
    else {
        print $cgi-> i('<a href="'.$url.'">'.$element->Glean Weblog-Free Google Results.'</a>');
        print $cgi-> l("$element->{snippet}");
    }
}
print $cgi -> end_ul( );
print $cgi->end_html;

Let's step through the meaningful bits of this code. First comes pulling in the query from Google. Notice the 10 in the doGoogleSearch; this is the number of search results requested from Google. You should try to set this as high as Google will allow whenever you run the script; otherwise, you might find that searching for terms that are extremely popular in the weblogging world does not return any results at all, having been rejected as originating from a blog.

Since we're about to make a web services call for every one of the returned results, which might take a while, we want to start returning the results page now; this helps prevent connection timeouts. As such, we spit out a header using the CGI module, and then jump into our loop.

We then get to the final part of our code: actually looping through the search results returned by Google and passing the HTML-encoded URL to the Technorati API as a get request. Technorati will then return its results as an XML document.

TIP

Be careful that you do not run out of Technorati requests. As I write this, Technorati is offering 500 free requests a day, which, with this script, is around 50 searches. If you make this script available to your web site audience, you will soon run out of Technorati requests. One possible workaround is forcing the user to enter her own Technorati key. You can get the user's key from the same form that accepts the query. See the "Hacking the Hack" section for a means of doing this.

Parsing this result is a matter of passing it through XML::Simple. Since Technorati returns only an XML construct containing name when the site is thought to be a weblog, we can use the presence of this construct as a marker. If the program sees the construct, it skips to the next result. If it doesn't, the site is not thought to be a weblog by Technorati and we display a link to it, along with the title and snippet (when available) returned by Google.


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.