
|
|
|
NoXML, Another SOAP::Lite Alternative
NoXML is a regular expressions-based, XML
Parser-free drop-in alternative to SOAP::Lite.
The Code

Contributed by:
[03/13/03 | Discuss (4) | Link to this hack] |
XML jockeys might well
want to avert their eyes for this one. What is herein suggested is
something just so preposterous that it just might prove
useful—and indeed it does. NoXML is a drop-in alternative to
SOAP::Lite. As its name suggests, this home-brewed module
doesn't make use of an XML parser of any kind,
relying instead on some dead-simple regular expressions and other
bits of programmatic magic.
If you have only a basic Perl installation at your disposal and are
lacking both the SOAP::Lite [Hack #52"] and XML::Parser Perl modules, NoXML will do in
a pinch, playing nicely with just about every Perl hack in this book.
TIP
As any XML guru will attest, there's simply no
substitute for an honest-to-goodness XML parser. And
they'd be right. There are encoding and hierarchy
issues that a regular expression-based parser simply
can't fathom. NoXML is simplistic at best. That
said, it does what needs doing, the very essence of
"hacking."
Best of all, NoXML can fill in for SOAP::Lite with little more than a
two-line alteration to the target hack.
The Code
Running the Hack
Run the script from the command line, providing a query on the
command line and piping the output to a CSV file you wish to create
or to which you wish to append additional results. For example, using
"no xml" as our query and
results.csv as your output:
$ perl noxml_google2csv.pl "no xml" > results.csv
Leaving off the > and CSV filename sends the
results to the screen for your perusal.
The Results
% perl noxml_google2csv.pl "no xml"
"title","url","snippet"
"site-comments@w3.org from January 2002: No XML specifications",
"http://lists.w3.org/Archives/Public/site-comments/2002Jan/0015.html",
"No XML specifications. From: Prof. ... Next message: Ian B. Jacobs:
"Re: No XML specifications"; Previous message: Rob Cummings:
"Website design..."; ... "
...
"Re: [xml] XPath with no XML Doc",
"http://mail.gnome.org/archives/xml/2002-March/msg00194.html",
" ... Re: [xml] XPath with no XML Doc. From: "Richard Jinks"
<cyberthymia yahoo co uk>; To: <xml gnome org>; Subject:
Re: [xml] XPath with no XML Doc; ... "
Applicability and Limitations
In the same manner, you can adapt just about any SOAP::Lite-based
hack in this book and those you've made up yourself
to use NoXML.
-
Place NoXML.pm in the same directory as the hack
at hand.
-
Replace use SOAP::Lite; with use
NoXML;.
-
Replace my $google_search =
SOAP::Lite->service("file:$google_wdsl"); with
my $google_search = new NoXML;.
There are, however, some limitations. While NoXML works nicely to
extract results and aggregate results the likes of
<estimatedTotalResultsCount />, it falls
down on gleaning some of the more advanced result elements like
<directoryCategories />, an array of
categories turned up by the query.
In general, bear in mind that your mileage may vary and
don't be afraid to tweak.
The CodeThe heart of this hack is NoXML.pm, which should
be saved into the same directory as your hacks themselves.
# NoXML.pm
# NoXML [pronounced "no xml"] is a dire-need drop-in
# replacement for SOAP::Lite designed for Google Web API hacking.
package NoXML;
use strict;
no strict "refs";
# LWP for making HTTP requests, XML for parsing Google SOAP
use LWP::UserAgent;
use XML::Simple;
# Create a new NoXML
sub new {
my $self = {};
bless($self);
return $self;
}
# Replacement for the SOAP::Lite-based doGoogleSearch method
sub doGoogleSearch {
my($self, %args);
($self, @args{qw/ key q start maxResults filter restrict
safeSearch lr ie oe /}) = @_;
# grab SOAP request from _ _DATA_ _
my $tell = tell(DATA);
my $soap_request = join '', ;
seek(DATA, $tell, 0);
$soap_request =~ s/\$(\w+)/$args{$1}/ge; #interpolate variables
# Make (POST) a SOAP-based request to Google
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(POST => 'http://api.google.com/search/beta2');
$req->content_type('text/xml');
$req->content($soap_request);
my $res = $ua->request($req);
my $soap_response = $res->as_string;
# Drop the HTTP headers and so forth until the initial xml element
$soap_response =~ s/^.+?(<\?xml)/$1/migs;
# Drop element namespaces for tolerance of future prefix changes
$soap_response =~ s!(<\/?)[\w-]+?:([\w-]+?)!$1$2!g;
# Set up a return dataset
my $return;
# Unescape escaped HTML in the resultset
my %unescape = ('<'=>'<', '>'=>'>', '&'=>'&', '"'=>'"', '''=>"'");
my $unescape_re = join '|' => keys %unescape;
# Divide the SOAP response into the results and other metadata
my($before, $results, $after) = $soap_response =~
m#(^.+)(.+?)(.+$)#migs ;
my $before_and_after = $before . $after;
# Glean as much metadata as possible (while being somewhat lazy ;-)
while ($before_and_after =~ m#([^<]*?)<#migs) {
$return->{$1} = $3; # pack the metadata into the return dataset
}
# Glean the results
my @results;
while ($results =~ m#(.+?)#migs) {
my $item = $1;
my $pairs = {};
while ( $item =~ m#([^<]*)#migs ) {
my($element, $value) = ($1, $2);
$value =~ s/($unescape_re)/$unescape{$1}/g;
$pairs->{$element} = $value;
}
push @results, $pairs;
}
# Pack the results into the return dataset
$return->{resultElements} = \@results;
# Return nice, clean, usable results
return $return;
}
1;
# This is the SOAP message template sent to api.google.com. Variables
# signified with $variablename are replaced by the values of their
# counterparts sent to the doGoogleSearch subroutine.
_ _DATA_ _
<?xml version='1.0' encoding='UTF-8'?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/1999/XMLSchema">
<SOAP-ENV:Body>
<ns1:doGoogleSearch xmlns:ns1="urn:GoogleSearch"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<key xsi:type="xsd:string">$key</key>
<q xsi:type="xsd:string">$q</q>
<start xsi:type="xsd:int">$start</start>
<maxResults xsi:type="xsd:int">$maxResults</maxResults>
<filter xsi:type="xsd:boolean">$filter</filter>
<restrict xsi:type="xsd:string">$restrict</restrict>
<safeSearch xsi:type="xsd:boolean">$safeSearch</safeSearch>
<lr xsi:type="xsd:string">$lr</lr>
<ie xsi:type="xsd:string">$ie</ie>
<oe xsi:type="xsd:string">$oe</oe>
</ns1:doGoogleSearch>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
Here's a little script to show NoXML in action.
It's no different, really, from any number of hacks
in this book. The only minor alterations necessary to make use of
NoXML instead of SOAP::Lite are highlighted in bold.
#!/usr/bin/perl
# noxml_google2csv.pl
# Google Web Search Results via NoXML ("no xml") module
# exported to CSV suitable for import into Excel
# Usage: noxml_google2csv.pl "{query}" [> results.csv]
# Your Google API developer's key
my $google_key='insert key here';
use strict;
# use SOAP::Lite;use NoXML;
$ARGV[0]
or die qq{usage: perl noxml_search2csv.pl "{query}"\n};
# my $google_search = SOAP::Lite->service("file:$google_wdsl");my $google_search = new NoXML;
my $results = $google_search ->
doGoogleSearch(
$google_key, shift @ARGV, 0, 10, "false",
"", "false", "", "latin1", "latin1"
);
@{$results->{'resultElements'}} or die('No results');
print qq{"title","url","snippet"\n};
foreach (@{$results->{'resultElements'}}) {
$_->NoXML, Another SOAP::Lite Alternative =~ s!"!""!g; # double escape " marks
$_->{snippet} =~ s!"!""!g;
my $output = qq{"$_->NoXML, Another SOAP::Lite Alternative","$_->{URL}","$_->{snippet}"\n};
$output =~ s!<.+?>!!g; # drop all HTML tags
print $output;
}
See also:
PoXML [#53], a plain old XML alternative to SOAP::Lite
XooMLE [#36], a third-party service offering an intermediary plain old XML interface to the Google Web API
Showing messages 1 through 4 of 4.
-
missing use files
2003-12-24 14:41:38
anonymous2
[View]
-
missing use files
2006-03-28 15:07:54
john-bokma
[View]
-
missing use files
2003-12-24 14:41:10
anonymous2
[View]
-
The result is wrong
2004-08-24 17:26:23
wassberg
[View]
|
Showing messages 1 through 4 of 4.
|
|
O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website:
| Customer Service:
| Book issues:
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
|
|