O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Spidering Hacks
By Kevin Hemenway, Tara Calishain
October 2003
More Info

HACK
#80
Reformatting Bugtraq Reports
Since Bugtraq is such an important part of a security administrator's watch list, it'll only be a matter of time before you'll want to integrate it more closely with your daily habits
The Code
[Discuss (0) | Link to this hack]

The Code

You'll need the HTML::TableExtract and LWP::Simple modules to grab the Bugtraq page. As we add more features, you'll also need XML::RSS, Net::AIM, and Net::SMTP. You could use other modules like URI::URL or HTML::Element to simplify this hack even further.

There are a couple of things to note about this code. We start by retrieving the arguments passed to the script that will be used to determine the output formats; we'll discuss those later. Next, the data scraped from the Bugtraq page is stuck into a custom data structure to make accessing it easier for later additions to this hack. Also, a subroutine is added to format the data contained in the data structure to ensure minimal code duplication once we have to format for multiple types of output.

Save the following code to a file called bugtraq_hack.pl:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Net::SMTP;
use Net::AIM;
use XML::RSS;

# get params for later use.
my $RUN_STATE = shift(@ARGV);

# the base URL of the site we are scraping and
# the URL of the page where the bugtraq list is located.
my $base_url = "http://www.security-focus.com";
my $url      = "http://www.security-focus.com/archive/1";

# get our data.
my $html_file = get($url) or die "$!\n";

# create an iso date.
my ($day, $month, $year) = (localtime)[3..5];
$year += 1900; my $date = "$year-$month-$day";

# since the data we are interested in is contained in a table,
# and the table has headers, then we can specify the headers and
# use TableExtract to grab all the data below the headers in one
# fell swoop. We want to keep the HTML code intact so that we
# can use the links in our output formats. start the parse:
my $table_extract =
   HTML::TableExtract->new(
     headers   => [qw(Date Subject Author)],
     keep_html => 1 );
$table_extract->parse($html_file);

# parse out the desired info and
# stuff into a data structure.
my @parsed_rows; my $ctr = 0;
foreach my $table ($table_extract->table_states) {
   foreach my $cols ($table->rows) {
      @$cols[0] =~ m|(\d+/\d+/\d+)|;
      my %parsed_cols = ( "date" => $1 );

      # since the subject links are in the 2nd column, parse unwanted HTML
      # and grab the anchor tags. Also, the subject links are relative, so
      # we have to expand them. I could have used URI::URL, HTML::Element,
      # HTML::Parse, etc. to do most of this as well.
      @$cols[1] =~ s/ class="[\w\s]*"//;
      @$cols[1] =~ m|(<a href="(.*)">(.*)</a>)|;
      $parsed_cols{"subject_html"} = "<a href=\"$base_url$2\">$3</a>";
      $parsed_cols{"subject_url"}  = "$base_url$2";
      $parsed_cols{"subject"}      = $3;

      # the author links are in the 3rd
      # col, so do the same thing.
      @$cols[2] =~ s/ class="[\w\s]*"//;
      @$cols[2] =~ m|(<a href="mailto:(.*@.*)">(.*)</a>)|;
      $parsed_cols{"author_html"}  = $1;
      $parsed_cols{"author_email"} = $2;
      $parsed_cols{"author"}       = $3;

      # put all the information into an
      # array of hashes for easy access.
      $parsed_rows[$ctr++] = \%parsed_cols;
   }
}
 
# if no params were passed, then
# simply output to stdout.
unless ($RUN_STATE) { print &format_my_data(  ); }

# formats the actual
# common data, per format.
sub format_my_data(  ) {
   my $data = "";

   foreach my $cols (@parsed_rows)  {
      unless ($RUN_STATE) { $data .= "$cols->{'date'} $cols->{'subject'}\n"; }
   }

   return $data;
}


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.