Scrape Google News Search results to get at the latest from thousands of aggregated news sources.
Google News, with its thousands of news sources worldwide, is a veritable treasure trove for any news hound. However, because you can’t access Google News through the Google API [Chapter 9], you’ll have to scrape your results from the HTML of a Google News results page. This hack does just that, gathering up results into a comma-delimited file suitable for loading into a spreadsheet or database. For each news story, it extracts the title, URL, source (i.e., news agency), publication date or age of the news item, and an excerpted description.
Because Google’s Terms of Service prohibit the automated access of their search engines except through the Google API, this hack does not actually connect to Google. Instead, it works on a page of results that you’ve saved from a Google News Search that you’ve run yourself. Simply save the results page as HTML source using your browser’s File→Save As... command.
Make sure the results are listed by date instead of relevance. When
results are listed by relevance, some of the descriptions are missing
because similar stories are clumped together. You can sort results by
date by choosing the “Sort by date”
link on the results page or by adding
&scoring=d
to the end of the results URL.
Also, make sure you’re getting the maximum number of
results by adding &num=100
to the end of the
results URL. For example, Figure 4-3 shows results
of a query for election 2004
, the latest on the
2004 United States Presidential Election—something of great
import at the time of this writing.
Warning
Bear in mind that scraping web pages is a brittle occupation. A single change in the HTML code underlying the standard Google News results page and you’re more than likely to end up with no results.
At the time of this writing, a typical Google News Search result looks a little something like this:
<a href="http://www.townhall.com/columnists/phyllisschlafly/ps20041018.shtml" id=r-1> Show Me State debate spotlights conservative issues</a><br><font size=-1> <font color=#6f6f6f>Town Hall, DC -</font> <nobr>8 minutes ago</nobr></font><br><font size=-1><b>...</b> The <b>2004</b> <b>election</b> presents a stark choice to voters on many issues. The second debate highlighted those differences on taxes, sovereignty <b>...</b> </font><br>
While for most of you this is utter gobbledygook, it is probably of some use to those trying to understand the regular expression matching in the following script.
Save this code to a file called news2csv.pl
:
#!/usr/bin/perl # news2csv.pl # Google News Results exported to CSV suitable for import into Excel. # Usage: perl news2csv.pl < news.html > news.csv print qq{"title","link","source","date or age", "description"\n}; my %unescape = ('<'=>'<', '>'=>'>', '&'=>'&', '"'=>'"', ' '=>' '); my $unescape_re = join '|' => keys %unescape; my $results = join '', <>; $results =~ s/($unescape_re)/$unescape{$1}/migs; # unescape HTML $results =~ s![\n\r]! !migs; # drop spurious newlines while ( $results =~ m!<a href="([^"]+)" id="?r-[0-9]+"?>(.+?)</a> <br>(.+?)<nobr>(.+?)</nobr>.*?<br>.+?)<br>migs { my($url, $title, $source, $date_age, $description) = ($1||'',$2||'',$3||'',$4||'', $5||''); $title =~ s!"!""!g; # double escape " marks $description =~ s!"!""!g; my $output = qq{"$title","$url","$source","$date_age","$description"\n}; $output =~ s!<.+?>!!g; # drop all HTML tags print $output; }
Run the script from the command line ["Running the
Hacks” in the the Preface], specifying the Google
News results HTML filename and the name of the CSV file that you wish
to create or to which you wish to append additional results. For
example, using news.html
as our input and
news.csv
as our output:
$perl news2csv.pl <
>
news.csv
Leaving off the >
and CSV filename sends the
results to the screen for your perusal.
The following are some of the 20,300 results returned by a Google
News Search for election 2004
and using the HTML
page of results shown in Figure 4-3:
$ perl news2csv.pl < news.html
"title","link","source","date or age", "description"
"Bush and Kerry on the razor's edge","http://www.dallasnews.com/sharedcontent/dws/dn/
opinion/columnists/mdavis/stories/101304dnedidavis.97af9.html","Dallas Morning News
(subscription),<A0>TX<A0>- ","12 minutes ago","... But just as the election
is a referendum on Mr. Bush, debate analysis depends on ... Tonight history will write
whether the Bush debate grade for 2004 is generally ... "
"MINNESOTA: Notes and quotes from battleground Minnesota in ...",
"http://www.twincities.com
/mld/twincities/news/state/minnesota/9902651.htm","Pioneer Press (subscription),<A0>
MN<A0>- ","16 minutes ago","... the state. Both sides had assumed Minnesota would
easily go Democratic, as it had for every presidential election since 1972. In ... "
"MINNESOTA: Notes and quotes from battleground Minnesota in ...","http://www.
duluthsuperior.com/mld/duluthsuperior/news/politics/9902651.htm","Duluth News Tribune,<
A0>MN<A0>- ","19 minutes ago","... the state. Both sides had assumed Minnesota
would easily go Democratic, as it had for every presidential election since 1972. In ... "
...
You’ll want to leave most of the
news2csv.pl
script alone, since
it’s been built to make sense out of the Google News
formatting. If you don’t like the way the program
organizes the information that’s taken out of the
results page, you can change it. Just rearrange the variables on the
following line, sorting them in any way that you choose. Be sure to
keep a comma between each one.
my $output = qq{"$title","$url","$source","$date_age","$description"\n};
For example, perhaps you want only the URL and title. The line should read:
my $output = qq{"$url","$title"\n};
That \n
specifies a new line, and the
$
characters specify that $url
and $title
are variable names; keep them intact.
Of course, now your output won’t match the header at the top of the CSV file, by default:
print qq{"title","link","source","date or age", "description"\n};
As before, simply change this to match, as follows:
print qq{"url","title"\n};
Get Google Hacks, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.