O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Spidering Hacks
By Morbus Iff, Tara Calishain
October 2003
More Info

HACK
#79
Word Associations with Lexical Freenet
There will come a time when you want a little more than simple word definitions, synonyms, or etymologies. Lexical Freenet takes you beyond these simple results, providing associative data, or "paths," from your word to others
The Code
[Discuss (0) | Link to this hack]

Lexical Freenet (http://www.lexfn.com) allows you to search for word relationships like puns, rhymes, concepts, relevant people, antonyms, and so much more. For example, a simple search for the word disease returns a long listing of word paths, each associated with other words by different types of connecting arrows: disease triggers both aids and cancer; comprises triggers symptoms; and bio triggers such relevant persons as janet elaine adkins, james parkinson, alois alzheimer, and so on. This is but a small sampling of the available and verbose output.

In combination with Super Word Lookup" , a command-line utility of the Lexical Freenet functionality would bring immense lookup capabilities to writers, librarians, and researchers. This hack shows you how to create said interface, with the ability to customize which relationships you'd like to see, as well as turn the visual connections into text.

The Code

Save the following code as lexfn.pl:

#!/usr/bin/perl-w
#
# Hack to query and report from www.lexfn.com
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
# by rik - ora@rikrose.net
#

######################
# support stage      #
######################

use strict;
use Getopt::Std qw(getopts);
use LWP::Simple qw(get);
use URI::Escape qw(uri_escape uri_unescape);
use HTML::TokeParser;

sub usage (  ) { print "
usage: lexfn [options] word1 [word2]
options available:
 -s Synonymous     -a Antonym        -b Birth Year
 -t Triggers       -r Rhymes         -d Death Year
 -g Generalizes    -l Sounds like    -T Bio Triggers
 -S Specialises    -A Anagram of     -k Also Known As
 -c Comprises      -o Occupation of
 -p Part of        -n Nationality

 or -x for all

word1 is mandatory, but some searches require word2\n\n"
}

######################
# parse stage        #
######################

# grab arguments, and put them into %args hash, leaving nonarguments
# in @ARGV for us to process later (where word1 and word2 would be)
# if we don't have at least one argument, we die with our usage.
my %args; getopts('stgScparlAonbdTkx', \%args);
if (@ARGV > 2 || @ARGV == 0) { usage(  ); exit 0; }

# turn both our words into queries.
$ARGV[0] =~ s/ /\+/g; $ARGV[1] ||= "";
if ($ARGV[1]) { $ARGV[1] =~ s/ /\+/g; }

# begin our URL construction with the keywords.
my $URL = "http://www.lexfn.com/l/lexfn-cuff.cgi?sWord=$ARGV[0]".
          "&tWord=$ARGV[1]&query=show&maxReach=2";

# now, let's figure out our command-line arguments. each
# argument is associated with a relevant search at LexFN,
# so we'll first create a mapping to and fro.
my %keynames = (
 s => 'ASYN', t => 'ATRG', g => 'AGEN', S => 'ASPC', c => 'ACOM', 
 p => 'APAR', a => 'AANT', r => 'ARHY', l => 'ASIM', A => 'AANA', 
 o => 'ABOX', n => 'ABNX', b => 'ABBX', d => 'ABDX', T => 'ABTR', 
 k => 'ABAK'
);

# if we want everything all matches
# then add them to our arguments hash,
# in preparation for our URL.
if (defined($args{'x'}) && $args{'x'} == 1) {
   foreach my $arg (qw/s t g l S c p a r l A o n b d T k/){
       $args{$arg} = 1; # in preparation for URL.
   } delete $args{'x'}; # x means nothing to LexFN.
}

# build the URL from the flags we want.
foreach my $arg (keys %args) { $URL .= '&' . $keynames{$arg} . '=on'; }

######################
# request stage      #
######################

# and download it all for parsing.
my $content = get($URL) or die $!;

######################
# extract stage      #
######################

# with the data sucked down, pass it off to the parser.
my $stream = HTML::TokeParser->new( \$content ) or die $!;

# skip the form on the page, then it's the first <b>
# after the form that we start extracting data from
my $tag = $stream->get_tag("/form");
while ($tag = $stream->get_tag("b")) {
    print $stream->get_trimmed_text("/b") . " ";
    $tag = $stream->get_tag("img");
    print $tag->[1]{alt} . " ";
    $tag = $stream->get_tag("a");
    print $stream->get_trimmed_text("/a") . "\n";
}

exit 0;

The code is split into four basic stages:

Support code

Such as includes and any subroutines you will need

The parsing stage

Where we work out what the user actually wants and build a URL to perform the request

The request stage itself

Where we retrieve the results

The extract stage

Where we recover the data

In this case, the Lexical Freenet site is basic enough that the request is a single URL. A typical Freenet URL looks something like this:

http://www.lexfn.com/l/lexfn-cuff.cgi?fromresub=on&
ASYN=on&ATRG=on&AGEN=on&ASPC=on&ACOM=on&APAR=on&AANT=on&
ARHY=on&ASIM=on&AANA=on&ABOX=on&ABNX=on&ABBX=on&ABDX=on&
ABTR=on&ABAK=on&sWord=lee+harvey+oswald&tWord=disobey&query=SHOW

The data we wish to extract is formed by repeatedly pulling the information from a standard and repetitive chunk of HTML in the search results. This allows us to use the simple HTML::TokeParser module to retrieve chunks of data easily by parsing the HTML tags, allowing us to query their attributes and retrieve the surrounding text. As you can tell from the previous code, this is not too difficult.


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.