O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Spidering Hacks
By Morbus Iff, Tara Calishain
October 2003
More Info

HACK
#74
What's Your Visitor's Weather Like?
You have a web site, as most people do, and you're interested in getting a general idea of what you're visitor's weather is like. Want to know if you get more comments when it's raining or sunny? With the groundwork laid in this hack, that and other nonsense will be readily available
The Code
[Discuss (0) | Link to this hack]

When you're spidering, don't consider only data available on the Web. Sometimes, the data is right under your nose, perhaps on your own server or even on your own hard drive . This hack demonstrates the large amount of information available, even when you have only a small amount of your own data to start with. In this case, we're looking at a web server's log file, taking the IP address of the last few visitors' sites, using one database to look up the geographical location of that IP address, and then using another to find the weather there. It's a trivial example, perhaps, but it's also quite nifty. For example, you could easily modify this code to greet visitors to your site with commiserations about the rain.

For the geographical data, we're going to use the Perl interface to the CAIDA project (http://www.caida.org/tools/utilities/netgeo/NGAPI/index.xml); for the weather data, we're using the Weather::Underground module, which utilizes the information at http://www.wunderground.com.

Using and Hacking the Hack

I have this script installed on my weblog using an Apache server-side include. This is probably a bad idea, given the potential for slow server responses on behalf of CAIDA and Weather Underground, but it does allow for completely up-to-date information. A more sensible approach might be to change the script to produce a static file and run this from cron every few minutes.

If you're sure of fast responses, and if you have a dynamically created page, it would be fun to customize that page based on the weather at the reader's location. Pithy comments about the rain are always appreciated. Tweaking the Weather Underground response to give you the temperature instead of a descriptive string creates the possibility of dynamically selecting CSS stylesheets, so that colors change based on the temperature. Storing the weather data over a period of time gives you the possibility of creating an "average readership temperature" or the amount of rain that has fallen on your audience this week. These would be fun statistics for some and perhaps extremely useful for others.

The code loads up the access_log, reverses it to put the last accesses at the top, and then goes through the resulting list, line by line. First, it runs the line through a regular expression:

my ($domain,$rfc931,$authuser,$TimeDate,$Request,$Status,$Bytes,$Referrer,$Agen
t) = $line =~ /^(\S+) (\S+) (\S+) \[([^\]\[]+)\] \"([^"]*)\" (\S+) (\S+) \"?([^"]*)\
"? \"([^"]*)\"/o;

This splits the line into its different sections and is based on Apache's combined log format. We'll be using only the first variable (the domain itself) from these results, but, because this regular expression is so useful, I include it for your cannibalistic pleasure.

Anyhow, we take the domain and pass it to the CAIDA module, retrieving a result and checking whether that result is useful. If it's not useful, we go to the next line in the access_log. This highlights an important point when using third-party databases: you must always check for a failed query. Indeed, it might even be a good idea to treat a successful query as the exception rather than the rule.

Assuming we have a good result, we need to detect if the country is the U.S. If it is, we make the $region the value of the U.S. state; otherwise, we use the two-letter code for the country. We use the country function from the Geography::Countries module to convert the full name of the country to the two-letter code.

—Ben Hammersley

The Code

Copy this code, changing the emphasized line to reflect the path to your Apache installation's access_log. Here, mine is in the same directory as the script:

#!/usr/bin/perl -w
#
# Ben Hammersley ben@benhammersley.com
# Looks up the real-world location of visiting IPs
# and then finds out the weather at those places
#

use strict;
use CAIDA::NetGeoClient;
use Weather::Underground;
use Geography::Countries;

my $apachelogfile = "access_log";
my $numberoflines = 10;
my $lastdomain    = "";

# Open up the logfile.
open (LOG, "<$apachelogfile") or die $!;

# Place all the lines of the logfile
# into an array, but in reverse order.
my @lines = reverse <LOG>;

# Start our HTML document.
print "<h2>Where my last few visitors came from:</h2>\n<ul>\n";

# Go through each line one
# by one, setting the variables.
my $i; foreach my $line (@lines) {
    my ($domain,$rfc931,$authuser,$TimeDate,
        $Request,$Status,$Bytes,$Referrer,$Agent) =
        $line =~ /^(\S+) (\S+) (\S+) \[([^\]\[]+)\] \"([^"]*)\" (\S+) # (\S+) 
\"?([^"]*)\"? \"([^"]*)\"/o;

    # If this record is one we saw
    # the last time around, move on.
    next if ($domain eq $lastdomain);

    # And now get the geographical info.
    my $geo     = CAIDA::NetGeoClient->new(  );
    my $record  = $geo->getRecord($domain);
    my $city    = ucfirst(lc($record->{CITY}));
    my $region  = "";

    # Check to see if there is a record returned at all.
    unless ($record->{COUNTRY}) { $lastdomain = $domain; next; }

    # If city is in the U.S., use the state as the "region". 
    # Otherwise, use Geography::Countries to munge the two letter
    # code for the country into its actual name. (Thanks to
    # Aaron Straup Cope for this tip.)
    if ($record->{COUNTRY} eq "US") {
        $region = ucfirst(lc($record->{STATE}));
    } else { $region = country($record->{COUNTRY}); }

    # Now get the weather information.
    my $place   = "$city, $region";
    my $weather = Weather::Underground->new(place => $place);
    my $data    = $weather->getweather(  );
    next unless $data; $data = $data->[0];

    # And print it for our HTML.
    print " <li>$city, $region where it is $data->{conditions}.</li>\n";

    # Record the last domain name
    # for the repeat prevention check
    $lastdomain = $domain;

    # Check whether you're not at the limit, and if you are, finish.
    if ($i++ >= $numberoflines-1) { last; }
}

print "</ul>";


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.