
|
|
|
Aggregating RSS and Posting Changes
With the proliferation of individual and group
weblogs, it's typical for one person to post in
multiple places. Thanks to RSS syndication, you can easily aggregate
all your disparate posts into one weblog
The Code
[Discuss (1) | Link to this hack] |
You might have heard of RSS. It's an XML format
that's commonly used to syndicate headlines and
content between sites. It's also used in specialty
software programs called headline aggregators or
readers. Many popular
weblog software packages, including
Movable Type
(http://www.movabletype.org) and
Blogger (http://www.blogger.com), offer RSS feeds. So
too do some of the
content management
systems—Slashcode
(http://slashcode.com),
PHPNuke (http://phpnuke.org),
Zope (http://www.zope.org), and the like—that
run some of the more popular tech news sites.
If you produce content for various people, you might find your
writing and commentary scattered all over the place. Or, say you have
a group of friends and all of you want to aggregate your postings
into a single place without abandoning your individual efforts. This
hack is a personal spider just for you; it aggregates entries from
multiple RSS feeds and posts those
new entries to a Movable Type blog.
Running the Hack
To run the code, you'll need a
Movable Type
weblog. At the very least, you need the username, password, XML-RPC
URL for Movable Type, and the blog ID (normally 1
if you have only one). Here's an example of
connecting to Kevin's Movable Type installation to
show a list of categories to post to (the
--showcategories switch is, strangely enough,
showing the categories):
% perl myrssmerger.pl -s http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi -u
morbus -p HAAHAHAH -b 1 --showcategories
The output looks like this:
----------------------------------------------------------------------
The following blog categories are available:
1: Disobey Stuff
2: The Idiot Box
3: CHIApet
4: Friends O' Disobey
5: Stalkers O' Morbus
6: Morbus Shoots, Jesus Saves
7: El Casho Disappearo
8: TechnOccult
9: Potpourri
10: Collected Nonsensicals
Category ID's can be used for --catid or -c.
----------------------------------------------------------------------
If you have no categories, you'll be told as such.
When you're actually posting to the blog, you can
choose to post into a category or not; if you want to post into
Disobey Stuff, use either -c 1 or --catid
1 when you run the program. If you want no category,
specify no category.
Let's take a look at a few examples of how to use
the script. Say Kevin wants to aggregate all the data from all the
places he publishes information. Every night he'll
use cron to run the script for various
RSS feeds. Here's an example:
% perl myrssmerger.pl --server [RETURN]
http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi [RETURN]
--username morbus --password HAAHAHAH --blogid 1 --catid 1
http://gamegrene.com/index.xml
In this case, he's saying, "Every
night, check the Gamegrene RSS files for entries posted today. If you
see any, post them to Disobey Stuff" (which is the
first category, referenced with the --catid 1
switch). He can then run the script again, only for a different RSS
feed with a different category switch, and so on.
Let's take a look at the output of the Gamegrene
example:
----------------------------------------------------------------------
Downloading RSS feed at http://gamegrene.com/index.xml...
Publishing item: 'RPG, For Me'.
Skipping (failed date check): 'Just Say No To Powergamers'.
Skipping (failed date check): 'Every Story Needs A Soundtrack'.
Skipping (failed date check): 'The Demise of Local Game Shops'.
Skipping (failed date check): 'Death Of A Gaming System'.
Skipping (failed date check): 'What Do You Do With Six Million Elves?'.
----------------------------------------------------------------------
As you can see, the script checks the dates in the RSS feed to make
sure they're new before the items are added to the
Movable Type weblog. Dates are determined from the
<dc:date> entry in the remote RSS URL; if
the feed doesn't have them, the script
won't function correctly.
What happens when you want to check many RSS feeds but you want to
add them all to the same category? You can do that by running the
script one time. Say you want to check three different RSS feeds, not
necessarily all yours. Here's an example of Kevin
checking three feeds (including Tara's) and adding
new additions to the category:
% perl myrssmerger.pl --server [RETURN]
http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi [RETURN]
--username morbus --password HAAHAHAH --blogid 1 --catid 4 [RETURN]
http://gamegrene.com/index.xml http://researchbuzz.com/researchbuzz.rss
http://camworld.com/index.rdf
The shortened output looks like this:
----------------------------------------------------------------------
Downloading RSS feed at http://gamegrene.com/index.xml...
Skipping (failed date check): 'RPG, For Me'.
Skipping (failed date check): 'Just Say No To Powergamers'.
Skipping (failed date check): 'Every Story Needs A Soundtrack'.
----------------------------------------------------------------------
Downloading RSS feed at http://camworld.com/index.rdf...
Publishing item: 'Trinity's Hack from Matrix Reloaded'.
Skipping (failed date check): 'Siberian Desktop'.
Skipping (failed date check): 'The Sweet Hereafter'.
----------------------------------------------------------------------
Downloading RSS feed at http://researchbuzz.com/researchbuzz.rss...
Skipping (no description/date): 'Northern Light Coming Back?'.
Skipping (no description/date): 'This Week in LLRX'.
----------------------------------------------------------------------
Note that Tara's feed fails usage by this script;
that's because she's generating her
RSS by hand and her feed doesn't have dates. Most
program-generated feeds, like those of Movable Type, have dates and
descriptions and will be just fine.
As you can see, we can choose a variety of feeds to use and we can
post them to any of our Movable Type categories. Is there anything
else this script can do? Well, actually, yes; it can filter incoming
entries that match a specified keyword. To do that, use the
--filter switch. As an example, this script posts
only those entries whose descriptions include the string
"perl":
% perl myrssmerger.pl --server [RETURN]
http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi [RETURN]
--username morbus --password HAAHAHAH --blogid 1 --catid 4 --filter "perl" [RETURN]
http://camworld.com/index.rdf
Hacking the Hack
Actually, this is both a "hacking the
hack" and "some things to
consider" section. Right now, the biggest downside
is that this hack works only on Movable Type. You could dive into
Net::Blogger a bit and make it usable by
Blogger (http://www.blogger.com),
Radio
Userland (http://radio.userland.com/), or any one of
the other weblogging platforms.
This script is designed to run once a day. To that end, the script
does a full download of the RSS feed every time. As it stands, you
should probably run it just once a day, for two reasons:
-
If you run the script more than once a day, you might have bandwidth
issues running the script and downloading full RSS files too often.
-
The more often you run the script, the more often
you're going to post repetitive items.
All right, let's talk about a couple of actual
hacks. First is error checking; as is, the script
doesn't check the URLs to make sure they start with
http://. That's easily solved;
just add the code in bold:
# loop through each RSS URL.
foreach my $rss_url (@ARGV) {
# not an HTTP URL.
next unless $rss_url =~ !^http://!;
# download whatever we've got coming.
Next, the preface and the anteface (i.e., the text that surrounds the
posted entry) are hardcoded into the script, but we can change that
via a switch on the command line. First make the preface and anteface
command-line options:
GetOptions(\%opts, 'server|s=s', # the POP3 server to use.
'username|u=s', # the POP3 username to use.
'password|p=s', # the POP3 password to use.
'blogid|b=i', # unique ID of your blog.
'catid|c=i', # unique ID for posting category.
'showcategories', # list categories for blog.
'filter|f=s', # per item filter for posting?
'preface|r=s', # the preface text before a posted item
'anteface|a=s" # the text included after a posted item
);
You'll then need to make a change to the preface
line:
my $preface = $opts{preface} || "From <a href=\"$clink\">$ctitle</a>:\n\n<blockquote>";
and a similar change to the anteface line:
my $anteface = $opts{anteface}
|| "</blockquote>\n\n"; # new items as quotes.
The CodeYou'll need LWP::Simple,
Net::Blogger, and XML::RSS
to use this. Save the following code to a file named
myrssmerger.pl: #!/usr/bin/perl -w
#
# MyRSSMerger - read multiple RSS feeds, post new entries to Movable Type.
# http://disobey.com/d/code/ or contact morbus@disobey.com.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
use strict; $|++;
my $VERSION = "1.0";
use Getopt::Long;
my %opts;
# make sure we have the modules we need, else die peacefully.
eval("use LWP::Simple;"); die "[err] LWP::Simple not installed.\n" if $@;
eval("use Net::Blogger;"); die "[err] Net::Blogger not installed.\n" if $@;
eval("use XML::RSS;"); die "[err] XML::RSS not installed.\n" if $@;
# define our command line flags (long and short versions).
GetOptions(\%opts, 'server|s=s', # the POP3 server to use.
'username|u=s', # the POP3 username to use.
'password|p=s', # the POP3 password to use.
'blogid|b=i', # unique ID of your blog.
'catid|c=i', # unique ID for posting category.
'showcategories', # list categories for blog.
'filter|f=s', # per item filter for posting?
);
# at the very least, we need our login information.
die "[err] XML-RPC URL missing, use --server or -s.\n" unless $opts{server};
die "[err] Username missing, use --username or -u.\n"
unless $opts{username};
die "[err] Password missing, use --password or -p.\n"
unless $opts{password};
die "[err] BlogID missing, use --blogid or -b.\n" unless $opts{blogid};
# every request past this point requires
# a connection, so we'll go and do so.
print "-" x 76, "\n"; # visual separator.
my $mt = Net::Blogger->new(engine=>"movabletype");
$mt->Proxy($opts{server}); # the servername.
$mt->Username($opts{username}); # the username.
$mt->Password($opts{password}); # the... ok. self-
$mt->BlogId($opts{blogid}); # explanatory!
# show existing categories.
if ($opts{showcategories}) {
# get the list of categories from the server.
my $cats = $mt->mt()->getCategoryList( )
or die "[err] ", $mt->LastError( ), "\n";
# and print 'em.
if (scalar(@$cats) > 0) {
print "The following blog categories are available:\n\n";
foreach (sort { $a->{categoryId} <=> $b->{categoryId} } @$cats) {
print " $_->{categoryId}: $_->{categoryName}\n";
}
} else { print "There are no selectable categories available.\n"; }
# done with this request, so exit.
print "\nCategory ID's can be used for --catid or -c.\n";
print "-" x 76, "\n"; exit; # call me again, again!
}
# now, check for passed URLs for new-item-examination.
die "[err] No RSS URLs were passed for processing.\n" unless @ARGV;
# and store today's date for comparison.
# who needs the stinkin' Date:: modules?!
my ($day, $month, $year) = ((localtime)[3, 4, 5]);
$year+=1900; $month = sprintf("%02.0d", ++$month);
$day = sprintf("%02.0d", $day); # zero-padding.
my $today = "$year-$month-$day"; # final version.
# loop through each RSS URL.
foreach my $rss_url (@ARGV) {
# download whatever we've got coming.
print "Downloading RSS feed at ", substr($rss_url, 0, 40), "...\n";
my $data = get($rss_url) or print " [err] Data not downloaded!\n";
next unless $data; # move onto the next URL in our list, if any.
# parse it and then
# count the number of items.
# move on if nothing parsed.
my $rss = new XML::RSS; $rss->parse($data);
my $item_count = scalar(@{$rss->{items}});
unless ($item_count) { print " [err] No parsable items.\n"; next; }
# sandwich our post between a preface/anteface.
my $clink = $rss->{channel}->{"link"}; # shorter variable.
my $ctitle = $rss->{channel}->Aggregating RSS and Posting Changes; # shorter variable.
my $preface = "From <a href=\"$clink\">$ctitle</a>:\n\n<blockquote>";
my $anteface = "</blockquote>\n\n"; # new items as quotes.
# and look for items dated today.
foreach my $item (@{$rss->{items}}) {
# no description or date for our item? move on.
unless ($item->{description} or $item->{dc}->{date}) {
print " Skipping (no description/date): '$item->Aggregating RSS and Posting Changes'.\n";
next;
}
# if we have a date, is it today's?
if ($item->{dc}->{date} =~ /^$today/) {
# shorter variable. we're lazy.
my $creator = $item->{dc}->{creator};
# if there's a filter, check for goodness.
if ($opts{filter} && $item->{description} !~ /$opts{filter}/i) {
print " Skipping (failed filter): '$item->Aggregating RSS and Posting Changes'.\n";
next;
}
# we found an item to post, so make a
# final description from various parts.
my $description = "$preface$item->{description} ";
$description .= "($creator) " if $creator;
$description .= "<a href=\"$item->{link}\">Read " .
"more from this post.</a>$anteface";
# now, post to the passed blog info.
print " Publishing item: '$item->Aggregating RSS and Posting Changes'.\n";
my $id = $mt->metaWeblog( )->newPost(
title => $item->Aggregating RSS and Posting Changes,
description => $description,
publish => 1)
or die "[err] ", $mt->LastError( ), "\n";
# set the category?
if ($opts{catid}) {
$mt->mt( )->setPostCategories(
postid => $id,
categories => [ {categoryId => $opts{catid}}])
or die " [err] ", $mt->LastError( ), "\n";
# "edit" the post with no changes so
# that our category change activates.
$mt->metaWeblog( )->editPost(
title => $item->Aggregating RSS and Posting Changes,
description => $description,
postid => $id,
publish => 1)
or die " [err] ", $mt->LastError( ), "\n";
}
} else {
print " Skipping (failed date check): '$item->Aggregating RSS and Posting Changes'.\n";
}
}
print "-" x 76, "\n"; # visual separator.
}
exit;
See also:
Showing messages 1 through 1 of 1.
-
Aggregating only to Individual Archive Template
2005-03-22 15:07:28
tkeating
[View]
|
Showing messages 1 through 1 of 1.
|
|
O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website:
| Customer Service:
| Book issues:
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
|
|
Im being testing the OReilly myrssmerger.pl script (which calls Net::Blogger) which takes an RSS feed and automatically turns it into a MovableType blog post.
It works great except for one thing. When I aggregate a large RSS file containing lots of entries, it fills my Main Movable Type template with outside news stories. I dont want to flood my readers with tons of outside stories on my blogs home page.
Id rather Publish aggregated RSS content just to the Individual Archives (not the Main Template) and then reference/link to interesting material when I blog throughout the day.
I notice in the Net::Blogger description there is a set template parameter, but Im not sure where to integrate that into the myrssmerger.pl script code. Suggestions?
p.s. One last thing - is there a way of saving an RSS-to-MT blog post as 'Draft' instead of 'Publish'? It would allow me to delete posts I deem unimportant before Publishing. I tried changing the Publish parameter from "1" to "0" but that had no effect.