O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Spidering Hacks
By Morbus Iff, Tara Calishain
October 2003
More Info

HACK
#68
Checking Blogs for New Comments
Tend to respond directly to weblog posts with a comment or three? Ever wonder about the reactions to your comments? This hack automates the process of keeping up with the conversation you started
The Code
[Discuss (0) | Link to this hack]

Blogs are the savior of independent publishing, and the ability of most to allow commenting creates an intimate collaboration between performer and audience: read the blog's entry and any existing comments, and then add your own thoughts and opinions. What's most annoying, however, is needing to return on a regular basis to see if anyone has added additional comments, whether to the original posting or to your own follow up.

With the RSS syndication format, you can monitor new blog entries in a standard way with any number of popular aggregators. Unfortunately, unless the site in question has provided its comments in RSS format also, there's not really a standard way for comments to be used, repurposed, or monitored.

However, the more you read blogs and the comments themselves, you'll begin to see patterns emerge. Perhaps a comment always starts with "On DATE, PERSON said" or "posted by PERSON on DATE," or even plain old "DATE, PERSON." These comment signatures can be the beginning of an answer to your needs: a script that uses regular expressions to check for various types of signatures can adequately tell you when new comments have been posted.

Running the Hack

This script depends on being fed a file that lists URLs you'd like to monitor. These should be the URLs of the page that holds comments on the blog entry, often the same as the blog entry's permanent link (or permalink). If you're reading http://www.gamegrene.com, for instance, and you've just commented on the "The Lazy GM" article, you'll add the following URL into a file named chkcomments.dat:

http://www.gamegrene.com/game_material/the_lazy_gm.shtml

A typical first run considers all comments new—new to you and your script:

% perl chkcomments.pl
Searching http://www.gamegrene.com/game_material/the_lazy_gm.shtml...
 * We saw a total of 5 comments (old count: unchecked).
 * Woo! There are new comments to read!

You can also show the name, date, and contact information of each individual comment, by passing the --verbose command-line option. This example shows the script checking for new comments on the same URL:

% perl chkcomments.pl --verbose
Searching http://www.gamegrene.com/game_material/the_lazy_gm.shtml...
  - July 23, 2003 01:53 AM: VMB (mailto:vesab@jippii.fi)
  - July 23, 2003 10:55 AM: Iridilate (mailto:)
  - July 29, 2003 02:46 PM: The Bebop Cow (mailto:blackcypress@yahoo.com)
... etc ...
 * We saw a total of 5 comments (old count: 5).

Since no comments were added between our first and second runs, there's nothing new.

But how did the script know how many comments there were in the first place? The answer, as I alluded to previously, is comment signatures. In HTML, every comment on Gamegrene looks like this:

On July 23, 2003 01:53 AM,<a href="mailto:vesab@jippii.fi">VMB</a> said:

In other words, it has a signature of On DATE, <a href="CONTACT">PERSON</a> said or, if you were expressing it as a regular expression, On (.*?), <a href="(.*?)">(.*?)<\/a> said. Keen observers of the script will have noticed this regular expression appear near the top of the code:

my @signatures = (
   { regex  => qr/On (.*?), <a href="(.*?)">(.*?)<\/a> said/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },

What about the assign line, though? Simply enough, it takes our captured bits of data from the regular expression (the bits that look like (.*?)) and assigns them to more easily understandable variables, like $date, $contact, and $name. The number of times our regular expression matches is the number of comments we've seen on the page. Likewise, the information stored in our variables is the information printed out when we ask for --verbose output.

If you refer back to the code, you'll notice two other signatures that match the comment styles on Dive Into Mark (http://www.diveintomark.org) and the O'Reilly Network (http://www.oreillynet.com) (and possibly other sites that we don't yet know about). Since their signatures already exist, we can add the following URLs to our chkcomments.dat file:

http://diveintomark.org/archives/2003/07/28/atom_news
http://www.oreillynet.com/pub/wlg/3593
http://macdevcenter.com/pub/a/mac/2003/08/01/cocoa_series.html?page=2

and run our script on a regular basis to check for new comments:

% perl chkcomments.pl 
Searching http://www.gamegrene.com/game_material/the_lazy_gm.shtml...
 * We saw a total of 5 comments (old count: 5).

Searching http://diveintomark.org/archives/2003/07/28/atom_news...
 * We saw a total of 11 comments (old count: unchecked).
 * Woo! There are new comments to read!

Searching http://www.oreillynet.com/pub/wlg/3593 ...
 * We saw a total of 1 comments (old count: unchecked).
 * Woo! There are new comments to read!

Searching http://macdevcenter.com/pub/a/mac/2003/08/01/cocoa_seri...
 * We saw a total of 9 comments (old count: unchecked).
 * Woo! There are new comments to read!

Hacking the Hack

The obvious way of improving the script is to add new comment signatures that match up with the sites you're reading. Say we want to monitor new comments on Harvard Weblogs (http://blogs.law.harvard.edu/). The first thing we need is a post with comments, so that we can determine the comment signature. Once we find one, view the HTML source to see something like this:

<div class="date"><a href="http://scripting.com">
Dave Winer</a> &#0149; 7/18/03; 7:58:33 AM</div>

The comment signature for Harvard Weblogs is equivalent to <a href="CONTACT">PERSON</a> DATE, which can be stated in regular expression form as date"><a href="(.*?)">(.*?)<\/a> &#0149; (.*?)<\/div>. Once we have the signature in regular expression form, we just need to assign our matches to the variable names and add the signature to our listings at the top:

my @signatures = (
   { regex  => qr/On (.*?), <a href="(.*?)">(.*?)<\/a> said/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/&middot; (.*?) &middot; .*?<a href="(.*?)">(.*?)<\/a>/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})&nbsp;(.*)/,
     assign => '($date,$name,$contact) = ($1,$2,"none")'
   },
   { regex  => qr/date"><a href="(.*)">(.*)<\/a> &#0149; (.*)<\/div>/,
     assign => '($contact,$name,$date) = ($1,$2,$3)'
   },
);

Now, just add the URL we want to monitor to our chkcomments.dat file, and run the script as usual. Here's an output of our first check, with verbosity turned on:

Searching http://blogs.law.harvard.edu/comments?u=homeManilaWebs...
  - 7/18/03; 1:23:14 AM: James Farmer (http://radio.weblogs.com/0120501/)
  - 7/18/03; 4:06:10 AM: Phil Wolff (http://dijest.com/aka)
  - 7/18/03; 7:58:33 AM: Dave Winer (http://scripting.com)
  - 7/18/03; 6:23:14 PM: Phil Wolff (http://dijest.com/aka)
 * We saw a total of 4 comments (old count: unchecked).
 * Woo! There are new comments to read!

The Code

Save this script as chkcomments.pl:

#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use LWP::Simple;
my %opts; GetOptions(\%opts, 'v|verbose');

# where we find URLs. we'll also use this
# file to remember the number of comments.
my $urls_file = "chkcomments.dat";

# what follows is a list of regular expressions and assignment
# code that will be executed in search of matches, per site.
my @signatures = (
   { regex  => qr/On (.*?), <a href="(.*?)">(.*?)<\/a> said/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/&middot; (.*?) &middot; .*?<a href="(.*?)">(.*?)<\/a>/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})&nbsp;(.*)/,
     assign => '($date,$name,$contact) = ($1,$2,"none")'
   },
);

# open our URL file, and suck it in.
open(URLS_FILE, "<$urls_file") or die $!;
my %urls; while (<URLS_FILE>) { chomp;
   my ($url, $count) = split(/\|%%\|/);
   $urls{$url} = $count || undef;
} close (URLS_FILE);

# foreach URL in our dat file:
foreach my $url (keys %urls) {

   next unless $url; # no URL, no cookie.
   my $old_count = $urls{$url} || undef;

   # print a little happy message.
   print "\nSearching $url...\n"; 

   # suck down the data.
   my $data = get($url) or next;

   # now, begin looping through our matchers.
   # for each regular expression and assignment
   # code, we execute it in this namespace in an
   # attempt to find matches in our loaded data.
   my $new_count; foreach my $code (@signatures) {

      # with our regular expression loaded,
      # let's see if we get any matches.
      while ($data =~ /$code->{regex}/gism) {

         # since our $code contains two Perl statements
         # (one being the regex, above, and the other
         # being the assignment code), we have to eval
         # it once more so the assignments kick in.
         my ($date, $contact, $name); eval $code->{assign};
         next unless ($date && $contact && $name);
         print "  - $date: $name ($contact)\n" if $opts{v};
         $new_count++; # increase the count.
      }

      # if we've gotten a comment count, then assume
      # our regex worked properly, spit out a message,
      # and assign our comment count for later storage.
      if ($new_count) {
         print " * We saw a total of $new_count comments".
               " (old count: ". ($old_count || "unchecked") . ").\n";
         if ($new_count > ($old_count || 0)) { # joy of joys!
             print " * Woo! There are new comments to read!\n"
         } $urls{$url} = $new_count; last; # end the loop.
      }
   }
} print "\n";

# now that our comment counts are updated,
# write it back out to our datafile.
open(URLS_FILE, ">$urls_file") or die $!;
foreach my $url (keys %urls) {
   print URLS_FILE "$url|%%|$urls{$url}\n";
} close (URLS_FILE);


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.