Mining the Social Web, 2nd Edition

Chapter 4. Mining Google+: Computing Document Similarity, Extracting Collocations, and More

This chapter introduces some fundamental concepts from text mining^[13] and is somewhat of an inflection point in this book. Whereas we started the book with basic frequency analyses of Twitter data and gradually worked up to more sophisticated clustering analyses of messier data from LinkedIn profiles, this chapter begins munging and making sense of textual information in documents by introducing information retrieval (IR) theory fundamentals such as TF-IDF, cosine similarity, and collocation detection. Accordingly, its content is a bit more complex than that of the chapters before it, and it may be helpful to have worked through those chapters before picking up here.

Google+ initially serves as our primary source of data for this chapter because it’s inherently social, features content that’s often expressed as longer-form notes that resemble blog entries, and is now an established staple in the social web. It’s not hard to make a case that Google+ activities (notes) fill a niche somewhere between Twitter and blogs. The concepts from previous chapters could equally be applied to the data behind Google+’s diverse and powerful API, although we’ll opt to leave regurgitations of those examples (mostly) as exercises for the motivated reader.

Wherever possible we won’t reinvent the wheel and implement analysis tools from scratch, but we will take a couple of “deep dives” when particularly foundational topics come up that are essential to an understanding of text mining. The Natural Language Toolkit (NLTK) is a powerful technology that you may recall from Chapter 3; it provides many of the tools we’ll use in this chapter. Its rich suites of APIs can be a bit overwhelming at first, but don’t worry: while text analytics is an incredibly diverse and complex field of study, there are lots of powerful fundamentals that can take you a long way without too significant of an investment. This chapter and the chapters after it aim to hone in on those fundamentals. (A full-blown introduction to NLTK is outside the scope of this book, but you can review the full text of Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit [O’Reilly] at the NLTK website.)

Note

Always get the latest bug-fixed source code for this chapter (and every other chapter) online at http://bit.ly/MiningTheSocialWeb2E. Be sure to also take advantage of this book’s virtual machine experience, as described in Appendix A, to maximize your enjoyment of the sample code.

Overview

This chapter uses Google+ to begin our journey in analyzing human language data. In this chapter you’ll learn about:

The Google+ API and making API requests
TF-IDF (Term Frequency–Inverse Document Frequency), a fundamental technique for analyzing words in documents
How to apply NLTK to the problem of understanding human language
How to apply cosine similarity to common problems such as querying documents by keyword
How to extract meaningful phrases from human language data by detecting collocation patterns

Exploring the Google+ API

Anyone with a Gmail account can trivially create a Google+ account and start collaborating with friends. From a product standpoint, Google+ has evolved rapidly and used some of the most compelling features of existing social network platforms such as Twitter and Facebook in carving out its own set of unique capabilities. As with the other social websites featured in this book, a full overview of Google+ isn’t in scope for this chapter, but you can easily read about it (or sign up) online. The best way to learn is by creating an account and spending some time exploring. For the most part, where there’s a feature in the user interface, there’s an API that provides that feature that you can tap into. Suffice it to say that Google+ has leveraged tried-and-true features of existing social networks, such as marking content with hashtags and maintaining a profile according to customizable privacy settings, with additional novelties such as a fresh take on content sharing called circles, video chats called hangouts, and extensive integration with other Google services such as Gmail contacts. In Google+ API parlance, social interactions are framed in terms of people, activities, comments, and moments.

The API documentation that’s available online is always the definitive source of guidance, but a brief overview may be helpful to get you thinking about how Google+ compares to another platform such as Twitter or Facebook:

People: People are Google+ users. Programmatically, you’ll either discover users by using the search API, look them up by a personalized URL if they’re a celebrity type,^[14] or strip their Google+ IDs out of the URLs that appear in your web browser and use them for exploring their profiles. Later in this chapter, we’ll use the search API to find users on Google+.
Activities: Activities are the things that people do on Google+. An activity is essentially a note and can be as long or short as the author likes: it can be as long as a blog post, or it can be devoid of any real textual meaning (e.g., just used to share links or multimedia content). Given a Google+ user, you can easily retrieve a list of that person’s activities, and we’ll do that later in this chapter. Like a tweet, an activity contains a lot of fascinating metadata , such as the number of times the activity has been reshared.
Comments: Leaving comments is the way Google+ users interact with one another. Simple statistical analysis of comments on Google+ could be very interesting and potentially reveal a lot of insights into a person’s social circles or the virality of content. For example, which other Google+ users most frequently comment on activities? Which activities have the highest numbers of comments (and why)?
Moments: Moments are a relatively recent addition to Google+ and represent a way of capturing interactions between a user and a Google+ application. Moments are similar to Facebook’s social graph stories in that they are designed to capture and create opportunities for user interaction with an application that can be displayed on a timeline. For example, if you were to make a purchase in an application, upload a photo, or watch a YouTube video, it could be captured as a moment (something you did in time) and displayed in a history of your actions or shared with friends in an activity stream by the application.

For the purposes of this chapter, we’ll be focusing on harvesting and analyzing Google+ activity data that is textual and intended to convey the same kind of meaning that you might encounter in a tweet, blog post, or Facebook status update. In other words, we’ll be trying to analyze human language data. If you haven’t signed up for Google+ yet, it’s worth taking the time to do so, as well as spending a few moments to familiarize yourself with a Google+ profile. One of the easiest ways to find specific users on Google+ is to just search for them at http://plus.google.com, and explore the platform from there.

As you explore Google+, bear in mind that it has a unique set of capabilities and is a little awkward to compare directly to other social web properties. If you’re expecting it to be a straightforward comparison, you might find yourself a bit surprised. It’s similar to Twitter in that it provides a “following” model where you can add individuals to one of your Google+ circles and keep up with their happenings without any approval on their part, but the Google+ platform also offers rich integration with other Google web properties, sports a powerful feature for videoconferencing (hangouts), and has an API similar to Facebook’s in the way that users share content and interact. Without further ado, let’s get busy exploring the API and work toward mining some data.

Making Google+ API Requests

From a software development standpoint, Google+ leverages OAuth like the rest of the social web to enable an application that you’ll build to access data on a user’s behalf, so you’ll need to register an application to get appropriate credentials for accessing the Google+ platform. The Google API Console provides a means of registering an application (called a project in the Google API Console) to get OAuth credentials but also exposes an API key that you can use for “simple API access.” This API key is what we’ll use in this chapter to programmatically access the Google+ platform and just about every other Google service.

Once you’ve created an application, you’ll also need to specifically enable it to use Google+ as a separate step. Figure 4-1 provides a screenshot of the Google+ API Console as well as a view that demonstrates what it looks like when you have enabled your application for Google+ API access.

Registering an application with the Google API Console to gain API access to Google services; don’t forget to enable Google+ API access as one of the service options

Figure 4-1. Registering an application with the Google API Console to gain API access to Google services; don’t forget to enable Google+ API access as one of the service options

You can install a Python package called google-api-python-client for accessing Google’s API via pip install google-api-python-client. This is one of the standard Python-based options for accessing Google+ data. The online documentation for google-api-python-client is marginally helpful in familiarizing yourself with the capabilities of what it offers, but in general, you’ll just be plugging parameters from the official Google+ API documents into some predictable access patterns with the Python package. Once you’ve walked through a couple of exercises, it’s a relatively straightforward process.

Note

Don’t forget that pydoc can be helpful for gathering clues about a package, class, or method in a terminal as you are learning it. The help function in a standard Python interpreter is also useful. Recall that appending ? to a method name in IPython is a shortcut for displaying its docstring.

As an initial exercise, let’s consider the problem of locating a person on Google+. Like any other social web API, the Google+ API offers a means of searching, and in particular we’ll be interested in the People: search API. Example 4-1 illustrates how to search for a person with the Google+ API. Since Tim O’Reilly is a well-known personality with an active and compelling Google+ account, let’s look him up.

The basic pattern that you’ll repeatedly use with the Python client is to create an instance of a service that’s parameterized for Google+ access with your API key that you can then instantiate for particular platform services. Here, we create a connection to the People API by invoking service.people() and then chaining on some additional API operations deduced from reviewing the API documentation online. In a moment we’ll query for activity data, and you’ll see that the same basic pattern holds.

Example 4-1. Searching for a person with the Google+ API

import httplib2
import json
import apiclient.discovery # pip install google-api-python-client

# XXX: Enter any person's name
Q = "Tim O'Reilly"

# XXX: Enter in your API key from  https://code.google.com/apis/console
API_KEY = '' 

service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), 
                                    developerKey=API_KEY)

people_feed = service.people().search(query=Q).execute()

print json.dumps(people_feed['items'], indent=1)

Following are sample results for searching for Tim O’Reilly:

[
 {
  "kind": "plus#person", 
  "displayName": "Tim O'Reilly", 
  "url": "https://plus.google.com/+TimOReilly", 
  "image": {
   "url": "https://lh4.googleusercontent.com/-J8..."
  }, 
  "etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/wwgOCMn..."", 
  "id": "107033731246200681024", 
  "objectType": "person"
 }, 
 {
  "kind": "plus#person", 
  "displayName": "Tim O'Reilly", 
  "url": "https://plus.google.com/11566571170551...", 
  "image": {
   "url": "https://lh3.googleusercontent.com/-yka..."
  }, 
  "etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/0z-EwRK7..."", 
  "id": "115665711705516993369", 
  "objectType": "person"
 }, 
 ...
]

The results do indeed return a list of people named Tim O’Reilly, but how can we tell which one of these results refers to the well-known Tim O’Reilly of technology fame that we are looking for? One option would be to request profile or activity information for each of these results and try to disambiguate them manually. Another option is to render the avatars included in each of the results, which is trivial to do by rendering the avatars as images within IPython Notebook. Example 4-2 illustrates how to display avatars and the corresponding ID values for each search result by generating HTML and rendering it inline as a result in the notebook.

Example 4-2. Displaying Google+ avatars in IPython Notebook provides a quick way to disambiguate the search results and discover the person you are looking for

from IPython.core.display import HTML

html = []

for p in people_feed['items']:
    html += ['<p><img src="%s" /> %s: %s</p>' % \
             (p['image']['url'], p['id'], p['displayName'])]

HTML(''.join(html))

Sample results are displayed in Figure 4-2 and provide the “quick fix” that we’re looking for in our search for the particular Tim O’Reilly of O’Reilly Media.

Figure 4-2. Rendering Google+ avatars as images allows you to quickly scan the search results to disambiguate the person you are looking for

Although there’s a multiplicity of things we could do with the People API, our focus in this chapter is on an analysis of the textual content in accounts, so let’s turn our attention to the task of retrieving activities associated with this account. As you’re about to find out, Google+ activities are the linchpin of Google+ content, containing a variety of rich content associated with the account and providing logical pivots to other platform objects such as comments. To get some activities, we’ll need to tweak the design pattern we applied for searching for people, as illustrated in Example 4-3.

Example 4-3. Fetching recent activities for a particular Google+ user

import httplib2
import json
import apiclient.discovery

USER_ID = '107033731246200681024' # Tim O'Reilly

# XXX: Re-enter your API_KEY from  https://code.google.com/apis/console
# if not currently set
# API_KEY = ''

service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), 
                                    developerKey=API_KEY)

activity_feed = service.activities().list(
  userId=USER_ID,
  collection='public',
  maxResults='100' # Max allowed per API
).execute()

print json.dumps(activity_feed, indent=1)

Sample results for the first item in the results (activity_feed['items'][0]) follow and illustrate the basic nature of a Google+ activity:

{
 "kind": "plus#activity", 
 "provider": {
  "title": "Google+"
 }, 
 "title": "This is the best piece about privacy that I've read in a ...",
 "url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ", 
 "object": {
  "resharers": {
   "totalItems": 191, 
   "selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvy..."
  }, 
  "attachments": [
   {
    "content": "Many governments (including our own, here in the US) ...", 
    "url": "http://www.zdziarski.com/blog/?p=2155", 
    "displayName": "On Expectation of Privacy | Jonathan Zdziarski's Domain", 
    "objectType": "article"
   }
  ], 
  "url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ", 
  "content": "This is the best piece about privacy that I&#39;ve read ...", 
  "plusoners": {
   "totalItems": 356, 
   "selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvyid..."
  }, 
  "replies": {
   "totalItems": 48, 
   "selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvyid..."
  }, 
  "objectType": "note"
 }, 
 "updated": "2013-04-25T14:46:16.908Z", 
 "actor": {
  "url": "https://plus.google.com/107033731246200681024", 
  "image": {
   "url": "https://lh4.googleusercontent.com/-J8nmMwIhpiA/AAAAAAAAAAI/A..."
  }, 
  "displayName": "Tim O'Reilly", 
  "id": "107033731246200681024"
 }, 
 "access": {
  "items": [
   {
    "type": "public"
   }
  ], 
  "kind": "plus#acl", 
  "description": "Public"
 }, 
 "verb": "post", 
 "etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/d-ppAzuVZpXrW_YeLXc5ctstsCM\"", 
 "published": "2013-04-25T14:46:16.908Z", 
 "id": "z125xvyidpqjdtol423gcxizetybvpydh"
}

Each activity object follows a three-tuple pattern of the form (actor, verb, object). In this post, the tuple (Tim O’Reilly, post, note) tells us that this particular item in the results is a note, which is essentially just a status update with some textual content. A closer look at the result reveals that the content is something that Tim O’Reilly feels strongly about as indicated by the title “This is the best piece about privacy that I’ve read in a long time!” and hints that the note is active as evidenced by the number of reshares and comments.

If you reviewed the output carefully, you may have noticed that the content field for the activity contains HTML markup, as evidenced by the HTML entity I've that appears. In general, you should assume that the textual data exposed as Google+ activities contains some basic markup—such as <br /> tags and escaped HTML entities for apostrophes—so as a best practice we need to do a little bit of additional filtering to clean it up. Example 4-4 provides an example of how to distill plain text from the content field of a note by introducing a function called cleanHtml. It takes advantage of a clean_html function provided by NLTK and another handy package for manipulating HTML, called BeautifulSoup, that converts HTML entities back to plain text. If you haven’t already encountered BeautifulSoup, it’s a package that you won’t want to live without once you’ve added it to your toolbox—it has the ability to process HTML in a reasonable way even if it is invalid and violates standards or other reasonable expectations (à la web data). You should install these packages via pip install nltk beautifulsoup4 if you haven’t already.

Example 4-4. Cleaning HTML in Google+ content by stripping out HTML tags and converting HTML entities back to plain-text representations

from nltk import clean_html
from BeautifulSoup import BeautifulStoneSoup

# clean_html removes tags and
# BeautifulStoneSoup converts HTML entities

def cleanHtml(html):
  if html == "": return ""

  return BeautifulStoneSoup(clean_html(html),
          convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

print activity_feed['items'][0]['object']['content']
print
print cleanHtml(activity_feed['items'][0]['object']['content'])

The output from the note’s content, once cleansed with cleanHtml, is very clean text that can be processed without additional concerns about noise in the content. As we’ll learn in this chapter and follow-on chapters about text mining, reduction of noise in text content is a critical aspect of improving accuracy. The before and after content follows.

Here’s the raw content in activity_feed['items'][0]['object']['content']:

This is the best piece about privacy that I&#39;ve read in a long time!  
If it doesn&#39;t change how you think about the privacy issue, I&#39;ll be 
surprised.  It opens:<br /><br />&quot;Many governments (including our own, 
here in the US) would have its citizens believe that privacy is a switch (that 
is, you either reasonably expect it, or you don’t). This has been demonstrated 
in many legal tests, and abused in many circumstances ranging from spying 
on electronic mail, to drones in our airspace monitoring the movements of  
private citizens. But privacy doesn’t work like a switch – at least it shouldn’t  
for a country that recognizes that privacy is an inherent right. In fact,  
privacy, like other components to security, works in layers...&quot;<br /><br />
Please read!

And here’s the content rendered after cleansing with cleanHtml(activity_feed['items'][0]['object']['content']):

This is the best piece about privacy that I've read in a long time!  If it 
doesn't change how you think about the privacy issue, I'll be surprised.  It 
opens: "Many governments (including our own, here in the US) would have its 
citizens believe that privacy is a switch (that is, you either reasonably expect it, 
or you don’t). This has been demonstrated in many legal tests, and abused 
in many circumstances ranging from spying on electronic mail, to drones in our 
airspace monitoring the movements of private citizens. But privacy doesn’t work like 
a switch – at least it shouldn’t for a country that recognizes that privacy is an 
inherent right. In fact, privacy, like other components to security, works in 
layers..." Please read!

The ability to query out clean text from Google+ is the basis for the remainder of the text mining exercises in this chapter, but one additional consideration that you may find useful before we focus our attention elsewhere is a pattern for fetching multiple pages of content.

Whereas the previous example fetched 100 activities, the maximum number of results for a query, it may be the case that you’ll want to iterate over an activities feed and retrieve more than the maximum number of activities per page. The pattern for pagination is outlined in the HTTP API Overview, and the Python client wrapper takes care of most of the hassle.

Example 4-5 shows how to fetch multiple pages of activities and distill the text from them if they are notes and have meaningful content.

Example 4-5. Looping over multiple pages of Google+ activities and distilling clean text from notes

import os
import httplib2
import json
import apiclient.discovery
from BeautifulSoup import BeautifulStoneSoup
from nltk import clean_html

USER_ID = '107033731246200681024' # Tim O'Reilly

# XXX: Re-enter your API_KEY from  https://code.google.com/apis/console 
# if not currently set
# API_KEY = '' 

MAX_RESULTS = 200 # Will require multiple requests

def cleanHtml(html):
  if html == "": return ""

  return BeautifulStoneSoup(clean_html(html),
          convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), 
                                    developerKey=API_KEY)

activity_feed = service.activities().list(
  userId=USER_ID,
  collection='public',
  maxResults='100' # Max allowed per request
)

activity_results = []

while activity_feed != None and len(activity_results) < MAX_RESULTS:

  activities = activity_feed.execute()

  if 'items' in activities:

    for activity in activities['items']:

      if activity['object']['objectType'] == 'note' and \
         activity['object']['content'] != '':

        activity['title'] = cleanHtml(activity['title'])
        activity['object']['content'] = cleanHtml(activity['object']['content'])
        activity_results += [activity]

  # list_next requires the previous request and response objects
  activity_feed = service.activities().list_next(activity_feed, activities)

# Write the output to a file for convenience

f = open(os.path.join('resources', 'ch04-googleplus', USER_ID + '.json'), 'w')
f.write(json.dumps(activity_results, indent=1))
f.close()

print str(len(activity_results)), "activities written to", f.name

With the know-how to explore the Google+ API and fetch some interesting human language data from activities’ content, let’s now turn our attention to the problem of analyzing the content.

A Whiz-Bang Introduction to TF-IDF

Although rigorous approaches to natural language processing (NLP) that include such things as sentence segmentation, tokenization, word chunking, and entity detection are necessary in order to achieve the deepest possible understanding of textual data, it’s helpful to first introduce some fundamentals from information retrieval theory. The remainder of this chapter introduces some of its more foundational aspects, including TF-IDF, the cosine similarity metric, and some of the theory behind collocation detection. Chapter 5 provides a deeper discussion of NLP as a logical continuation of this discussion.

Note

If you want to dig deeper into IR theory, the full text of Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze’s Introduction to Information Retrieval (Cambridge University Press) is available online and provides more information than you could (probably) ever want to know about the field.

Information retrieval is an extensive field with many specialties. This discussion narrows in on TF-IDF, one of the most fundamental techniques for retrieving relevant documents from a corpus (collection). TF-IDF stands for term frequency-inverse document frequency and can be used to query a corpus by calculating normalized scores that express the relative importance of terms in the documents.

Mathematically, TF-IDF is expressed as the product of the term frequency and the inverse document frequency, tf_idf = tf*idf, where the term tf represents the importance of a term in a specific document, and idf represents the importance of a term relative to the entire corpus. Multiplying these terms together produces a score that accounts for both factors and has been an integral part of every major search engine at some point in its existence. To get a more intuitive idea of how TF-IDF works, let’s walk through each of the calculations involved in computing the overall score.

Term Frequency

For simplicity in illustration, suppose you have a corpus containing three sample documents and terms are calculated by simply breaking on whitespace, as illustrated in Example 4-6 as ordinary Python code.

Example 4-6. Sample data structures used in illustrations for the rest of this chapter

corpus = { 
 'a' : "Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.",
 'b' : "Professor Plum has a green plant in his study.",
 'c' : "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."
}
terms = {
 'a' : [ i.lower() for i in corpus['a'].split() ],
 'b' : [ i.lower() for i in corpus['b'].split() ],
 'c' : [ i.lower() for i in corpus['c'].split() ]
 }

A term’s frequency could simply be represented as the number of times it occurs in the text, but it is more commonly the case that you normalize it by taking into account the total number of terms in the text, so that the overall score accounts for document length relative to a term’s frequency. For example, “green” (once normalized to lowercase) occurs twice in corpus['a'] and only once in corpus['b'], so corpus['a'] would produce a higher score if frequency were the only scoring criterion. However, if you normalize for document length, corpus['b'] would have a slightly higher term frequency score for “green” (1/9) than corpus['a'] (2/19), because corpus['b'] is shorter than corpus['a']—even though “green” occurs more frequently in corpus['a']. A common technique for scoring a compound query such as “Mr. Green” is to sum the term frequency scores for each of the query terms in each document, and return the documents ranked by the summed term frequency score.

Let’s illustrate how term frequency works by querying our sample corpus for “Mr. Green,” which would produce the normalized scores reported in Table 4-1 for each document.

Table 4-1. Sample term frequency scores for “Mr. Green”

Document	tf(mr.)	tf(green)	Sum
`corpus['a']`	2/19	2/19	4/19 (0.2105)
`corpus['b']`	0	1/9	1/9 (0.1111)
`corpus['c']`	0	1/16	1/16 (0.0625)

For this contrived example, a cumulative term frequency scoring scheme works out and returns corpus['a'] (the document that we’d expect it to return), since corpus['a'] is the only one that contains the compound token “Mr. Green.” However, a number of problems could have emerged, because the term frequency scoring model looks at each document as an unordered collection of words. For example, queries for “Green Mr.” or “Green Mr. Foo” would have returned the exact same scores as the query for “Mr. Green,” even though neither of those phrases appears in the sample sentences. Additionally, there are a number of scenarios that we could easily contrive to illustrate fairly poor results from the term frequency ranking technique given that trailing punctuation is not handled properly, and that the context around tokens of interest is not taken into account by the calculations.

Considering term frequency alone turns out to be a common source of problems when scoring on a document-by-document basis, because it doesn’t account for very frequent words, called stopwords,^[15] that are common across many documents. In other words, all terms are weighted equally, regardless of their actual importance. For example, “the green plant” contains the stopword “the,” which skews overall term frequency scores in favor of corpus['a'] because “the” appears twice in that document, as does “green.” In contrast, in corpus['c'] “green” and “plant” each appear only once.

Consequently, the scores would break down as shown in Table 4-2, with corpus['a'] ranked as more relevant than corpus['c'], even though intuition might lead you to believe that ideal query results probably shouldn’t have turned out that way. (Fortunately, however, corpus['b'] still ranks highest.)

Table 4-2. Sample term frequency scores for “the green plant”

Document	tf(the)	tf(green)	tf(plant)	Sum
`corpus['a']`	2/19	2/19	0	4/19 (0.2105)
`corpus['b']`	0	1/9	1/9	2/9 (0.2222)
`corpus['c']`	0	1/16	1/16	1/8 (0.125)

Inverse Document Frequency

Toolkits such as NLTK provide lists of stopwords that can be used to filter out terms such as and, a, and the, but keep in mind that there may be terms that evade even the best stopword lists and yet still are quite common to specialized domains. Although you can certainly customize a list of stopwords with domain knowledge, the inverse document frequency metric is a calculation that provides a generic normalization metric for a corpus. It works in the general case by accounting for the appearance of common terms across a set of documents by considering the total number of documents in which a query term ever appears.

The intuition behind this metric is that it produces a higher value if a term is somewhat uncommon across the corpus than if it is common, which helps to account for the problem with stopwords we just investigated. For example, a query for “green” in the corpus of sample documents should return a lower inverse document frequency score than a query for “candlestick,” because “green” appears in every document while “candlestick” appears in only one. Mathematically, the only nuance of interest for the inverse document frequency calculation is that a logarithm is used to reduce the result into a compressed range, since its usual application is in multiplying it against term frequency as a scaling factor. For reference, a logarithm function is shown in Figure 4-3; as you can see, the logarithm function grows very slowly as values for its domain increase, effectively “squashing” its input.

The logarithm function “squashes” a large range of values into a more compressed space—notice how slowly the y-values grow as the values of x-increase

Figure 4-3. The logarithm function “squashes” a large range of values into a more compressed space—notice how slowly the y-values grow as the values of x-increase

Table 4-3 provides inverse document frequency scores that correspond to the term frequency scores for in the previous section. Example 4-7 in the next section presents source code that shows how to compute these scores. In the meantime, you can notionally think of the IDF score for a term as the logarithm of a quotient that is defined by the number of documents in the corpus divided by the number of texts in the corpus that contain the term. When viewing these tables, keep in mind that whereas a term frequency score is calculated on a per-document basis, an inverse document frequency score is computed on the basis of the entire corpus. Hopefully this makes sense given that its purpose is to act as a normalizer for common words across the entire corpus.

Table 4-3. Sample inverse document frequency scores for terms appearing in “mr. green” and “the green plant”

idf(mr.)	idf(green)	idf(the)	idf(plant)
1+ log(3/1) = 2.0986	1 + log(3/3) = 1.0	1 + log(3/1) = 2.0986	1 + log(3/2) = 1.4055

TF-IDF

At this point, we’ve come full circle and devised a way to compute a score for a multiterm query that accounts for the frequency of terms appearing in a document, the length of the document in which any particular term appears, and the overall uniqueness of the terms across documents in the entire corpus. We can combine the concepts behind term frequency and inverse document frequency into a single score by multiplying them together, so that TF–IDF = TF*IDF. Example 4-7 is a naive implementation of this discussion that should help solidify the concepts described. Take a moment to review it, and then we’ll discuss a few sample queries.

Example 4-7. Running TF-IDF on sample data

from math import log

# XXX: Enter in a query term from the corpus variable
QUERY_TERMS = ['mr.', 'green']

def tf(term, doc, normalize=True):
    doc = doc.lower().split()
    if normalize:
        return doc.count(term.lower()) / float(len(doc))
    else:
        return doc.count(term.lower()) / 1.0


def idf(term, corpus):
    num_texts_with_term = len([True for text in corpus if term.lower()
                              in text.lower().split()])

    # tf-idf calc involves multiplying against a tf value less than 0, so it's
    # necessary to return a value greater than 1 for consistent scoring. 
    # (Multiplying two values less than 1 returns a value less than each of 
    # them.)

    try:
        return 1.0 + log(float(len(corpus)) / num_texts_with_term)
    except ZeroDivisionError:
        return 1.0


def tf_idf(term, doc, corpus):
    return tf(term, doc) * idf(term, corpus)


corpus = \
    {'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.',
     'b': 'Professor Plum has a green plant in his study.',
     'c': "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."}

for (k, v) in sorted(corpus.items()):
    print k, ':', v
print
    
# Score queries by calculating cumulative tf_idf score for each term in query

query_scores = {'a': 0, 'b': 0, 'c': 0}
for term in [t.lower() for t in QUERY_TERMS]:
    for doc in sorted(corpus):
        print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc])
    print 'IDF: %s' % (term, ), idf(term, corpus.values())
    print

    for doc in sorted(corpus):
        score = tf_idf(term, corpus[doc], corpus.values())
        print 'TF-IDF(%s): %s' % (doc, term), score
        query_scores[doc] += score
    print

print "Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), )
for (doc, score) in sorted(query_scores.items()):
    print doc, score

Sample output follows:

a : Mr. Green killed Colonel Mustard in the study...
b : Professor Plum has a green plant in his study.
c : Miss Scarlett watered Professor Plum's green...

TF(a): mr. 0.105263157895
TF(b): mr. 0.0
TF(c): mr. 0.0
IDF: mr. 2.09861228867

TF-IDF(a): mr. 0.220906556702
TF-IDF(b): mr. 0.0
TF-IDF(c): mr. 0.0

TF(a): green 0.105263157895
TF(b): green 0.111111111111
TF(c): green 0.0625
IDF: green 1.0

TF-IDF(a): green 0.105263157895
TF-IDF(b): green 0.111111111111
TF-IDF(c): green 0.0625

Overall TF-IDF scores for query 'mr. green'
a 0.326169714597
b 0.111111111111
c 0.0625

Although we’re working on a trivially small scale, the calculations involved work the same for larger data sets. Table 4-4 is a consolidated adaptation of the program’s output for three sample queries that involve four distinct terms:

“green”
“mr. green”
“the green plant”

Even though the IDF calculations for terms are computed on the basis of the entire corpus, they are displayed on a per-document basis so that you can easily verify TF-IDF scores by skimming a single row and multiplying two numbers. As you work through the query results, you’ll find that it’s remarkable just how powerful TF-IDF is, given that it doesn’t account for the proximity or ordering of words in a document.

Table 4-4. Calculations involved in TF-IDF sample queries, as computed by Example 4-7

Document	tf(mr.)	tf(green)	tf(the)	tf(plant)
`corpus['a']`	0.1053	0.1053	0.1053	0
`corpus['b']`	0	0.1111	0	0.1111
`corpus['c']`	0	0.0625	0	0.0625

	idf(mr.)	idf(green)	idf(the)	idf(plant)
	2.0986	1.0	2.099	1.4055

	tf-idf(mr.)	tf-idf(green)	tf-idf(the)	tf-idf(plant)
`corpus['a']`	0.1053*2.0986 = 0.2209	0.1053*1.0 = 0.1053	0.1053*2.099 = 0.2209	0*1.4055 = 0
`corpus['b']`	0*2.0986 = 0	0.1111*1.0 = 0.1111	0*2.099 = 0	0.1111*1.4055 = 0.1562
`corpus['c']`	0*2.0986 = 0	0.0625*1.0 = 0.0625	0*2.099 = 0	0.0625*1.4055 = 0.0878

The same results for each query are shown in Table 4-5, with the TF-IDF values summed on a per-document basis.

Table 4-5. Summed TF-IDF values for sample queries as computed by Example 4-7 (values in bold are the maximum scores for each of the three queries)

Query	`corpus['a']`	`corpus['b']`	`corpus['c']`
green	0.1053	0.1111	0.0625
Mr. Green	0.2209 + 0.1053 = 0.3262	0 + 0.1111 = 0.1111	0 + 0.0625 = 0.0625
the green plant	0.2209 + 0.1053 + 0 = 0.3262	0 + 0.1111 + 0.1562 = 0.2673	0 + 0.0625 + 0.0878 = 0.1503

From a qualitative standpoint, the query results are quite reasonable. The corpus['b'] document is the winner for the query “green,” with corpus['a'] just a hair behind. In this case, the deciding factor was the length of corpus['b'] being much smaller than that of corpus['a']—the normalized TF score tipped the results in favor of corpus['b'] for its one occurrence of “green,” even though “Green” appeared in corpus['a'] two times. Since “green” appears in all three documents, the net effect of the IDF term in the calculations was a wash.

Do note, however, that if we had returned 0.0 instead of 1.0 for “green,” as is done in some IDF implementations, the TF-IDF scores for “green” would have been 0.0 for all three documents due the effect of multiplying the TF score by zero. Depending on the particular situation, it may be better to return 0.0 for the IDF scores rather than 1.0. For example, if you had 100,000 documents and “green” appeared in all of them, you’d almost certainly consider it to be a stopword and want to remove its effects in a query entirely.

For the query “Mr. Green,” the clear and appropriate winner is the corpus['a'] document. However, this document also comes out on top for the query “the green plant.” A worthwhile exercise is to consider why corpus['a'] scored highest for this query as opposed to corpus['b'], which at first blush might have seemed a little more obvious.

A final nuanced point to observe is that the sample implementation provided in Example 4-7 adjusts the IDF score by adding a value of 1.0 to the logarithm calculation, for the purposes of illustration and because we’re dealing with a trivial document set. Without the 1.0 adjustment in the calculation, it would be possible to have the idf function return values that are less than 1.0, which would result in two fractions being multiplied in the TF-IDF calculation. Since multiplying two fractions together results in a value smaller than either of them, this turns out to be an easily overlooked edge case in the TF-IDF calculation. Recall that the intuition behind the TF-IDF calculation is that we’d like to be able to multiply two terms in a way that consistently produces larger TF-IDF scores for more relevant queries than for less relevant queries.

Querying Human Language Data with TF-IDF

Let’s take the theory that you just learned about in the previous section and put it to work. In this section, you’ll get officially introduced to NLTK, a powerful toolkit for processing natural language, and use it to support the analysis of human language data that we’ll fetch from Google+.

Introducing the Natural Language Toolkit

NLTK is written such that you can explore data easily and begin to form some impressions without a lot of upfront investment. Before skipping ahead, though, consider following along with the interpreter session in Example 4-8 to get a feel for some of the powerful functionality that NLTK provides right out of the box. Since you may not have done much work with NLTK before, don’t forget that you can use the built-in help function to get more information whenever you need it. For example, help(nltk) would provide documentation on the NLTK package in an interpreter session.

Not all of the functionality from NLTK is intended for incorporation into production software, since output is written to the console and not capturable into a data structure such as a list. In that regard, methods such as nltk.text.concordance are considered “demo functionality.” Speaking of which, many of NLTK’s modules have a demo function that you can call to get some idea of how to use the functionality they provide, and the source code for these demos is a great starting point for learning how to use new APIs. For example, you could run nltk.text.demo() in the interpreter to get some additional insight into the capabilities provided by the nltk.text module.

Example 4-8 demonstrates some good starting points for exploring the data with sample output included as part of an interactive interpreter session, and the same commands to explore the data are included in the IPython Notebook for this chapter. Please follow along with this example and examine the outputs of each step along the way. Are you able to follow along and understand the investigative flow of the interpreter session? Take a look, and we’ll discuss some of the details after the example.

Note

The next example includes stopwords, which—as noted earlier—are words that appear frequently in text but usually relay very little information (e.g., a, an, the, and other determinants).

Example 4-8. Exploring Google+ data with NLTK

# Explore some of NLTK's functionality by exploring the data. 
# Here are some suggestions for an interactive interpreter session.

import nltk

# Download ancillary nltk packages if not already installed
nltk.download('stopwords')

all_content = " ".join([ a['object']['content'] for a in activity_results ])

# Approximate bytes of text
print len(all_content)

tokens = all_content.split()
text = nltk.Text(tokens)

# Examples of the appearance of the word "open"
text.concordance("open")

# Frequent collocations in the text (usually meaningful phrases)
text.collocations()

# Frequency analysis for words of interest
fdist = text.vocab()
fdist["open"]
fdist["source"]
fdist["web"]
fdist["2.0"]

# Number of words in the text
len(tokens)

# Number of unique words in the text

len(fdist.keys())

# Common words that aren't stopwords
[w for w in fdist.keys()[:100] \
   if w.lower() not in nltk.corpus.stopwords.words('english')]

# Long words that aren't URLs
[w for w in fdist.keys() if len(w) > 15 and not w.startswith("http")]

# Number of URLs
len([w for w in fdist.keys() if w.startswith("http")])

# Enumerate the frequency distribution
for rank, word in enumerate(fdist): 
    print rank, word, fdist[word]

Note

The examples throughout this chapter, including the prior example, use the split method to tokenize text. Tokenization isn’t quite as simple as splitting on whitespace, however, and Chapter 5 introduces more sophisticated approaches for tokenization that work better for the general case.

The last command in the interpreter session lists the words from the frequency distribution, sorted by frequency. Not surprisingly, stopwords like the, to, and of are the most frequently occurring, but there’s a steep decline and the distribution has a very long tail. We’re working with a small sample of text data, but this same property will hold true for any frequency analysis of natural language.

Zipf’s law, a well-known empirical law of natural language, asserts that a word’s frequency within a corpus is inversely proportional to its rank in the frequency table. What this means is that if the most frequently occurring term in a corpus accounts for N% of the total words, the second most frequently occurring term in the corpus should account for (N/2)% of the words, the third most frequent term for (N/3)% of the words, and so on. When graphed, such a distribution (even for a small sample of data) shows a curve that hugs each axis, as you can see in Figure 4-4.

Though perhaps not initially obvious, most of the area in such a distribution lies in its tail and for a corpus large enough to span a reasonable sample of a language, the tail is always quite long. If you were to plot this kind of distribution on a chart where each axis was scaled by a logarithm, the curve would approach a straight line for a representative sample size.

Zipf’s law gives you insight into what a frequency distribution for words appearing in a corpus should look like, and it provides some rules of thumb that can be useful in estimating frequency. For example, if you know that there are a million (nonunique) words in a corpus, and you assume that the most frequently used word (usually the, in English) accounts for 7% of the words,^[16] you could derive the total number of logical calculations an algorithm performs if you were to consider a particular slice of the terms from the frequency distribution. Sometimes this kind of simple, back-of-the-napkin arithmetic is all that it takes to sanity-check assumptions about a long-running wall-clock time, or confirm whether certain computations on a large enough data set are even tractable.

The frequency distribution for terms appearing in a small sample of Google+ data “hugs” each axis closely; plotting it on a log-log scale would render it as something much closer to a straight line with a negative slope

Figure 4-4. The frequency distribution for terms appearing in a small sample of Google+ data “hugs” each axis closely; plotting it on a log-log scale would render it as something much closer to a straight line with a negative slope

Note

Can you graph the same kind of curve shown in Figure 4-4 for the content from a few hundred Google+ activities using the techniques introduced in this chapter combined with IPython’s plotting functionality, as introduced in Chapter 1?

Applying TF-IDF to Human Language

Let’s apply TF-IDF to the Google+ data we collected earlier and see how it works out as a tool for querying the data. NLTK provides some abstractions that we can use instead of rolling our own, so there’s actually very little to do now that you understand the underlying theory. The listing in Example 4-9 assumes you saved the working Google+ data from earlier in this chapter as a JSON file, and it allows you to pass in multiple query terms that are used to score the documents by relevance.

Example 4-9. Querying Google+ data with TF-IDF

import json
import nltk

# Load in human language data from wherever you've saved it

DATA = 'resources/ch04-googleplus/107033731246200681024.json'
data = json.loads(open(DATA).read())

# XXX: Provide your own query terms here

QUERY_TERMS = ['SOPA']

activities = [activity['object']['content'].lower().split() \
              for activity in data \
                if activity['object']['content'] != ""]

# TextCollection provides tf, idf, and tf_idf abstractions so 
# that we don't have to maintain/compute them ourselves

tc = nltk.TextCollection(activities)

relevant_activities = []

for idx in range(len(activities)):
    score = 0
    for term in [t.lower() for t in QUERY_TERMS]:
        score += tc.tf_idf(term, activities[idx])
    if score > 0:
        relevant_activities.append({'score': score, 'title': data[idx]['title'],
                              'url': data[idx]['url']})

# Sort by score and display results

relevant_activities = sorted(relevant_activities, 
                             key=lambda p: p['score'], reverse=True)
for activity in relevant_activities:
    print activity['title']
    print '\tLink: %s' % (activity['url'], )
    print '\tScore: %s' % (activity['score'], )
    print

Sample query results for “SOPA,” a controversial piece of proposed legislation, on Tim O’Reilly’s Google+ data are as follows:

I think the key point of this piece by +Mike Loukides, that PIPA and SOPA provide 
a "right of ext...
    Link: https://plus.google.com/107033731246200681024/posts/ULi4RYpvQGT
    Score: 0.0805961208217

Learn to Be a Better Activist During the SOPA Blackouts +Clay Johnson has put  
together an awesome...
    Link: https://plus.google.com/107033731246200681024/posts/hrC5aj7gS6v
    Score: 0.0255051015259

SOPA and PIPA are bad industrial policy There are many arguments against SOPA 
and PIPA that are b...
    Link: https://plus.google.com/107033731246200681024/posts/LZs8TekXK2T
    Score: 0.0227351539694

Further thoughts on SOPA, and why Congress shouldn't listen to lobbyists 
Colleen Taylor of GigaOM...
    Link: https://plus.google.com/107033731246200681024/posts/5Xd3VjFR8gx
    Score: 0.0112879721039

...

Given a search term, being able to zero in on three Google+ content items ranked by relevance is of tremendous benefit when analyzing unstructured text data. Try out some other queries and qualitatively review the results to see for yourself how well the TF-IDF metric works, keeping in mind that the absolute values of the scores aren’t really important—it’s the ability to find and sort documents by relevance that matters. Then, begin to ponder the countless ways that you could tune or augment this metric to be even more effective. One obvious improvement that’s left as an exercise for the reader is to stem verbs so that variations in elements such as tense and grammatical role resolve to the same stem and can be more accurately accounted for in similarity calculations. The nltk.stem module provides easy-to-use implementations for several common stemming algorithms.

Now let’s take our new tools and apply them to the foundational problem of finding similar documents. After all, once you’ve zeroed in on a document of interest, the next natural step is to discover other content that might be of interest.

Finding Similar Documents

Once you’ve queried and discovered documents of interest, one of the next things you might want to do is find similar documents. Whereas TF-IDF can provide the means to narrow down a corpus based on search terms, cosine similarity is one of the most common techniques for comparing documents to one another, which is the essence of finding a similar document. An understanding of cosine similarity requires a brief introduction to vector space models, which is the topic of the next section.

The theory behind vector space models and cosine similarity

While it has been emphasized that TF-IDF models documents as unordered collections of words, another convenient way to model documents is with a model called a vector space. The basic theory behind a vector space model is that you have a large multidimensional space that contains one vector for each document, and the distance between any two vectors indicates the similarity of the corresponding documents. One of the most beautiful things about vector space models is that you can also represent a query as a vector and discover the most relevant documents for the query by finding the document vectors with the shortest distance to the query vector.

Although it’s virtually impossible to do this subject justice in a short section, it’s important to have a basic understanding of vector space models if you have any interest at all in text mining or the IR field. If you’re not interested in the background theory and want to jump straight into implementation details on good faith, feel free to skip ahead to the next section.

Warning

This section assumes a basic understanding of trigonometry. If your trigonometry skills are a little rusty, consider this section a great opportunity to brush up on high school math. If you’re not feeling up to it, just skim this section and rest assured that there is some mathematical rigor that backs the similarity computation we’ll be employing to find similar documents.

First, it might be helpful to clarify exactly what is meant by the term vector, since there are so many subtle variations associated with it across various fields of study. Generally speaking, a vector is a list of numbers that expresses both a direction relative to an origin and a magnitude, which is the distance from that origin. A vector can naturally be represented as a line segment drawn between the origin and a point in an N-dimensional space.

To illustrate, imagine a document that is defined by only two terms (“Open” and “Web”), with a corresponding vector of (0.45, 0.67), where the values in the vector are values such as TF-IDF scores for the terms. In a vector space, this document could be represented in two dimensions by a line segment extending from the origin at (0, 0) to the point at (0.45, 0.67). In reference to an x/y plane, the x-axis would represent “Open,” the y-axis would represent “Web,” and the vector from (0, 0) to (0.45, 0.67) would represent the document in question. Nontrivial documents generally contain hundreds of terms at a minimum, but the same fundamentals apply for modeling documents in these higher-dimensional spaces; it’s just harder to visualize.

Try making the transition from visualizing a document represented by a vector with two components to a document represented by three dimensions, such as (“Open,” “Web,” and “Government”). Then consider taking a leap of faith and accepting that although it’s hard to visualize, it is still possible to have a vector represent additional dimensions that you can’t easily sketch out or see. If you’re able to do that, you should have no problem believing that the same vector operations that can be applied to a 2-dimensional space can be applied equally well to a 10-dimensional space or a 367-dimensional space. Figure 4-5 shows an example vector in 3-dimensional space.

An example vector with the value (–3, –2, 5) plotted in 3D space; from the origin, move to the left three units, move down two units, and move up five units to arrive at the point

Figure 4-5. An example vector with the value (–3, –2, 5) plotted in 3D space; from the origin, move to the left three units, move down two units, and move up five units to arrive at the point

Given that it’s possible to model documents as term-centric vectors, with each term in the document represented by its corresponding TF-IDF score, the task is to determine what metric best represents the similarity between two documents. As it turns out, the cosine of the angle between any two vectors is a valid metric for comparing them and is known as the cosine similarity of the vectors. Although it’s perhaps not yet intuitive, years of scientific research have demonstrated that computing the cosine similarity of documents represented as term vectors is a very effective metric. (It does suffer from many of the same problems as TF-IDF, though; see Closing Remarks for a brief synopsis.) Building up a rigorous proof of the details behind the cosine similarity metric would be beyond the scope of this book, but the gist is that the cosine of the angle between any two vectors indicates the similarity between them and is equivalent to the dot product of their unit vectors.

Intuitively, it might be helpful to consider that the closer two vectors are to one another, the smaller the angle between them will be, and thus the larger the cosine of the angle between them will be. Two identical vectors would have an angle of 0 degrees and a similarity metric of 1.0, while two vectors that are orthogonal to one another would have an angle of 90 degrees and a similarity metric of 0.0. The following sketch attempts to demonstrate:

Recalling that a unit vector has a length of 1.0 (by definition), you can see that the beauty of computing document similarity with unit vectors is that they’re already normalized against what might be substantial variations in length. We’ll put all of this newly found knowledge to work in the next section.

Clustering posts with cosine similarity

One of the most important points to internalize from the previous discussion is that to compute the similarity between two documents, you really just need to produce a term vector for each document and compute the dot product of the unit vectors for those documents. Conveniently, NLTK exposes the nltk.cluster.util.cosine_distance(v1,v2) function for computing cosine similarity, so it really is pretty straightforward to compare documents. As the upcoming Example 4-10 shows, all of the work involved is in producing the appropriate term vectors; in short, it computes term vectors for a given pair of documents by assigning TF-IDF scores to each component in the vectors. Because the exact vocabularies of the two documents are probably not identical, however, placeholders with a value of 0.0 must be left in each vector for words that are missing from the document at hand but present in the other one. The net effect is that you end up with two vectors of identical length with components ordered identically that can be used to perform the vector operations.

For example, suppose document1 contained the terms (A, B, C) and had the corresponding vector of TF-IDF weights (0.10, 0.15, 0.12), while document2 contained the terms (C, D, E) with the corresponding vector of TF-IDF weights (0.05, 0.10, 0.09). The derived vector for document1 would be (0.10, 0.15, 0.12, 0.0, 0.0), and the derived vector for document2 would be (0.0, 0.0, 0.05, 0.10, 0.09). Each of these vectors could be passed into NLTK’s cosine_distance function, which yields the cosine similarity. Internally, cosine_distance uses the numpy module to very efficiently compute the dot product of the unit vectors, and that’s the result. Although the code in this section reuses the TF-IDF calculations that were introduced previously, the exact scoring function could be any useful metric. TF-IDF (or some variation thereof), however, is quite common for many implementations and provides a great starting point.

Example 4-10 illustrates an approach for using cosine similarity to find the most similar document to each document in a corpus of Google+ data. It should apply equally well to any other type of human language data, such as blog posts or books.

Example 4-10. Finding similar documents using cosine similarity

import json
import nltk

# Load in human language data from wherever you've saved it

DATA = 'resources/ch04-googleplus/107033731246200681024.json'
data = json.loads(open(DATA).read())

# Only consider content that's ~1000+ words.
data = [ post for post in json.loads(open(DATA).read())
         if len(post['object']['content']) > 1000 ]

all_posts = [post['object']['content'].lower().split() 
             for post in data ]


# Provides tf, idf, and tf_idf abstractions for scoring

tc = nltk.TextCollection(all_posts)

# Compute a term-document matrix such that td_matrix[doc_title][term]
# returns a tf-idf score for the term in the document

td_matrix = {}
for idx in range(len(all_posts)):
    post = all_posts[idx]
    fdist = nltk.FreqDist(post)

    doc_title = data[idx]['title']
    url = data[idx]['url']
    td_matrix[(doc_title, url)] = {}

    for term in fdist.iterkeys():
        td_matrix[(doc_title, url)][term] = tc.tf_idf(term, post)
        
# Build vectors such that term scores are in the same positions...

distances = {}
for (title1, url1) in td_matrix.keys():

    distances[(title1, url1)] = {}
    (min_dist, most_similar) = (1.0, ('', ''))

    for (title2, url2) in td_matrix.keys():

        # Take care not to mutate the original data structures
        # since we're in a loop and need the originals multiple times

        terms1 = td_matrix[(title1, url1)].copy()
        terms2 = td_matrix[(title2, url2)].copy()

        # Fill in "gaps" in each map so vectors of the same length can be computed

        for term1 in terms1:
            if term1 not in terms2:
                terms2[term1] = 0

        for term2 in terms2:
            if term2 not in terms1:
                terms1[term2] = 0

        # Create vectors from term maps

        v1 = [score for (term, score) in sorted(terms1.items())]
        v2 = [score for (term, score) in sorted(terms2.items())]

        # Compute similarity amongst documents

        distances[(title1, url1)][(title2, url2)] = \
            nltk.cluster.util.cosine_distance(v1, v2)

        if url1 == url2:
            #print distances[(title1, url1)][(title2, url2)]
            continue

        if distances[(title1, url1)][(title2, url2)] < min_dist:
            (min_dist, most_similar) = (distances[(title1, url1)][(title2,
                                         url2)], (title2, url2))
    
    print '''Most similar to %s (%s)
\t%s (%s)
\tscore %f
''' % (title1, url1,
            most_similar[0], most_similar[1], 1-min_dist)

If you’ve found this discussion of cosine similarity interesting, it might at first seem almost magical when you realize that the best part is that querying a vector space is the same operation as computing the similarity between documents, except that instead of comparing just document vectors, you compare your query vector and the document vectors. Take a moment to think about it: it’s a rather profound insight that the mathematics work out that way.

In terms of implementing a program to compute similarity across an entire corpus, however, take note that the naive approach means constructing a vector containing your query terms and comparing it to every single document in the corpus. Clearly, the approach of directly comparing a query vector to every possible document vector is not a good idea for even a corpus of modest size, and you’d need to make some good engineering decisions involving the appropriate use of indexes to achieve a scalable solution. We briefly touched upon the fundamental problem of needing a dimensionality reduction as a common staple in clustering in Chapter 3, and here we see the same concept emerge. Any time you encounter a similarity computation, you will almost imminently encounter the need for a dimensionality reduction to make the computation tractable.

Visualizing document similarity with a matrix diagram

The approach for visualizing similarity between items as introduced in this section is by using use graph-like structures, where a link between documents encodes a measure of the similarity between them. This situation presents an excellent opportunity to introduce more visualizations from D3, the state-of-the-art visualization toolkit introduced in Chapter 2. D3 is specifically designed with the interests of data scientists in mind, offers a familiar declarative syntax, and achieves a nice middle ground between high-level and low-level interfaces.

A minimal adaptation to Example 4-10 is all that’s needed to emit a collection of nodes and edges that can be used to produce visualizations similar to those in the D3 examples gallery. A nested loop can compute the similarity between the documents in our sample corpus of Google+ data from earlier in this chapter, and linkages between items may be determined based upon a simple statistical thresholding criterion.

Note

The details associated with munging the data and producing the output required to power the visualizations won’t be presented here, but the turnkey example code is provided in the IPython Notebook for this chapter.

The code produces the matrix diagram in Figure 4-6. Although the text labels are not readily viewable as printed in this image, you could look at the cells in the matrix for patterns and view the labels in your browser with an interactive visualization. For example, hovering over a cell might display a tool tip with each label.

An advantage of a matrix diagram versus a graph-based layout is that there’s no potential for messy overlap between edges that represent linkages, so you avoid the proverbial “hairball” problem with your display. However, the ordering of rows and columns affects the intuition about the patterns that may be present in the matrix, so careful thought should be given to the best ordering for the rows and columns. Usually, rows and columns have additional properties that could be used to order them such that it’s easier to pinpoint patterns in the data. A recommended exercise for this chapter is to spend some time enhancing the capabilities of this matrix diagram.

Figure 4-6. A matrix diagram displaying linkages between Google+ activities

Analyzing Bigrams in Human Language

As previously mentioned, one issue that is frequently overlooked in unstructured text processing is the tremendous amount of information gained when you’re able to look at more than one token at a time, because so many concepts we express are phrases and not just single words. For example, if someone were to tell you that a few of the most common terms in a post are “open,” “source,” and “government,” could you necessarily say that the text is probably about “open source,” “open government,” both, or neither? If you had a priori knowledge of the author or content, you could probably make a good guess, but if you were relying totally on a machine to try to classify a document as being about collaborative software development or transformational government, you’d need to go back to the text and somehow determine which of the other two words most frequently occurs after “open”—that is, you’d like to find the collocations that start with the token “open.”

Recall from Chapter 3 that an n-gram is just a terse way of expressing each possible consecutive sequence of n tokens from a text, and it provides the foundational data structure for computing collocations. There are always (n–1) n-grams for any value of n, and if you were to consider all of the bigrams (two grams) for the sequence of tokens ["Mr.", "Green", "killed", "Colonel", "Mustard"], you’d have four possibilities: [("Mr.", "Green"), ("Green", "killed"), ("killed", "Colonel"), ("Colonel", "Mustard")]. You’d need a larger sample of text than just our sample sentence to determine collocations, but assuming you had background knowledge or additional text, the next step would be to statistically analyze the bigrams in order to determine which of them are likely to be collocations.

It’s worth noting that the storage necessary for persisting an n-gram model requires space for (T–1)*n tokens (which is practically T*n), where T is the number of tokens in question and n is defined by the size of the desired n-gram. As an example, assume a document contains 1,000 tokens and requires around 8 KB of storage. Storing all bigrams for the text would require roughly double the original storage, or 16 KB, as you would be storing 999*2 tokens plus overhead. Storing all trigrams for the text (998*3 tokens plus overhead) would require roughly triple the original storage, or 24 KB. Thus, without devising specialized data structures or compression schemes, the storage costs for n-grams can be estimated as n times the original storage requirement for any value of n.

n-grams are very simple yet very powerful as a technique for clustering commonly co-occurring words. If you compute all of the n-grams for even a small value of n, you’re likely to discover that some interesting patterns emerge from the text itself with no additional work required. (Typically, bigrams and trigrams are what you’ll often see used in practice for data mining exercises.) For example, in considering the bigrams for a sufficiently long text, you’re likely to discover the proper names, such as “Mr. Green” and “Colonel Mustard,” concepts such as “open source” or “open government,” and so forth. In fact, computing bigrams in this way produces essentially the same results as the collocations function that you ran earlier, except that some additional statistical analysis takes into account the use of rare words. Similar patterns emerge when you consider frequent trigrams and n-grams for values of n slightly larger than three. As you already know from Example 4-8, NLTK takes care of most of the effort in computing n-grams, discovering collocations in a text, discovering the context in which a token has been used, and more. Example 4-11 demonstrates.

Example 4-11. Using NLTK to compute bigrams and collocations for a sentence

import nltk

sentence = "Mr. Green killed Colonel Mustard in the study with the " + \
           "candlestick. Mr. Green is not a very nice fellow."

print nltk.ngrams(sentence.split(), 2)
txt = nltk.Text(sentence.split())

txt.collocations()

A drawback to using built-in “demo” functionality such as nltk.Text.collocations is that these functions don’t usually return data structures that you can store and manipulate. Whenever you run into such a situation, just take a look at the underlying source code, which is usually pretty easy to learn from and adapt for your own purposes. Example 4-12 illustrates how you could compute the collocations and concordance indexes for a collection of tokens and maintain control of the results.

Note

In a Python interpreter, you can usually find the source directory for a package on disk by accessing the package’s __file__ attribute. For example, try printing out the value of nltk.__file__ to find where NLTK’s source is at on disk. In IPython or IPython Notebook, you could use “double question mark magic” function to preview the source code on the spot by executing nltk??.

Example 4-12. Using NLTK to compute collocations in a similar manner to the nltk.Text.collocations demo functionality

import json
import nltk

# Load in human language data from wherever you've saved it

DATA = 'resources/ch04-googleplus/107033731246200681024.json'
data = json.loads(open(DATA).read())

# Number of collocations to find

N = 25

all_tokens = [token for activity in data for token in activity['object']['content'
              ].lower().split()]

finder = nltk.BigramCollocationFinder.from_words(all_tokens)
finder.apply_freq_filter(2)
finder.apply_word_filter(lambda w: w in nltk.corpus.stopwords.words('english'))
scorer = nltk.metrics.BigramAssocMeasures.jaccard
collocations = finder.nbest(scorer, N)

for collocation in collocations:
    c = ' '.join(collocation)
    print c

In short, the implementation loosely follows NLTK’s collocations demo function. It filters out bigrams that don’t appear more than a minimum number of times (two, in this case) and then applies a scoring metric to rank the results. In this instance, the scoring function is the well-known Jaccard Index we discussed in Chapter 3, as defined by nltk.metrics.BigramAssocMeasures.jaccard. A contingency table is used by the BigramAssocMeasures class to rank the co-occurrence of terms in any given bigram as compared to the possibilities of other words that could have appeared in the bigram. Conceptually, the Jaccard Index measures similarity of sets, and in this case, the sample sets are specific comparisons of bigrams that appeared in the text.

The details of how contingency tables and Jaccard values are calculated is arguably an advanced topic, but the next section, Contingency tables and scoring functions, provides an extended discussion of those details since they’re foundational to a deeper understanding of collocation detection.

In the meantime, though, let’s examine some output from Tim O’Reilly’s Google+ data that makes it pretty apparent that returning scored bigrams is immensely more powerful than returning only tokens, because of the additional context that grounds the terms in meaning:

ada lovelace
jennifer pahlka
hod lipson
pine nuts
safe, welcoming
1st floor,
5 southampton
7ha cost:
bcs, 1st
borrow 42
broadcom masters
building, 5
date: friday
disaster relief
dissolvable sugar
do-it-yourself festival,
dot com
fabric samples
finance protection
london, wc2e
maximizing shareholder
patron profiles
portable disaster
rural co
vat tickets:

Keeping in mind that no special heuristics or tactics that could have inspected the text for proper names based on Title Case were employed, it’s actually quite amazing that so many proper names and common phrases were sifted out of the data. For example, Ada Lovelace is a fairly well-known historical figure that Mr. O’Reilly is known to write about from time to time (given her affiliation with computing), and Jennifer Pahlka is popular for her “Code for America” work that Mr. O’Reilly closely follows. Hod Lipson is an accomplished robotics professor at Cornell University. Although you could have read through the content and picked those names out for yourself, it’s remarkable that a machine could do it for you as a means of bootstrapping your own more focused analysis.

There’s still a certain amount of inevitable noise in the results because we have not yet made any effort to clean punctuation from the tokens, but for the small amount of work we’ve put in, the results are really quite good. This might be the right time to mention that even if reasonably good natural language processing capabilities were employed, it might still be difficult to eliminate all the noise from the results of textual analysis. Getting comfortable with the noise and finding heuristics to control it is a good idea until you get to the point where you’re willing to make a significant investment in obtaining the perfect results that a well-educated human would be able to pick out from the text.

Hopefully, the primary observation you’re making at this point is that with very little effort and time invested, we’ve been able to use another basic technique to draw out some powerful meaning from some free text data, and the results seem to be pretty representative of what we already suspect should be true. This is encouraging, because it suggests that applying the same technique to anyone else’s Google+ data (or any other kind of unstructured text, for that matter) would potentially be just as informative, giving you a quick glimpse into key items that are being discussed. And just as importantly, while the data in this case probably confirms a few things you may already know about Tim O’Reilly, you may have learned a couple of new things, as evidenced by the people who showed up at the top of the collocations list. While it would be easy enough to use the concordance, a regular expression, or even the Python string type’s built-in find method to find posts relevant to “ada lovelace,” let’s instead take advantage of the code we developed in Example 4-9 and use TF-IDF to query for “ada lovelace.” Here’s what comes back:

I just got an email from +Suw Charman about Ada Lovelace Day, 
and thought I'd share it here, sinc...
    Link: https://plus.google.com/107033731246200681024/posts/1XSAkDs9b44
    Score: 0.198150014715

And there you have it: the “ada lovelace” query leads us to some content about Ada Lovelace Day. You’ve effectively started with a nominal (if that) understanding of the text, zeroed in on some interesting topics using collocation analysis, and searched the text for one of those topics using TF-IDF. There’s no reason you couldn’t also use cosine similarity at this point to find the most similar post to the one about the lovely Ada Lovelace (or whatever it is that you’re keen to investigate).

Contingency tables and scoring functions

Note

This section dives into some of the more technical details of how BigramCollocationFinder—the Jaccard scoring function from Example 4-12—works. If this is your first reading of the chapter or you’re not interested in these details, feel free to skip this section and come back to it later. It’s arguably an advanced topic, and you don’t need to fully understand it to effectively employ the techniques from this chapter.

A common data structure that’s used to compute metrics related to bigrams is the contingency table. The purpose of a contingency table is to compactly express the frequencies associated with the various possibilities for the appearance of different terms of a bigram. Take a look at the bold entries in Table 4-6, where token1 expresses the existence of token1 in the bigram, and ~token1 expresses that token1 does not exist in the bigram.

Table 4-6. Contingency table example—values in italics represent “marginals,” and values in bold represent frequency counts of bigram variations

	token1	~token1
token2	frequency(token1, token2)	frequency(~token1, token2)	frequency(, token2)*
~token2	frequency(token1, ~token2)	frequency(~token1, ~token2)
	frequency(token1, )*		frequency(, )

Although there are a few details associated with which cells are significant for which calculations, hopefully it’s not difficult to see that the four middle cells in the table express the frequencies associated with the appearance of various tokens in the bigram. The values in these cells can compute different similarity metrics that can be used to score and rank bigrams in order of likely significance, as was the case with the previously introduced Jaccard Index, which we’ll dissect in just a moment. First, however, let’s briefly discuss how the terms for the contingency table are computed.

The way that the various entries in the contingency table are computed is directly tied to which data structures you have precomputed or otherwise have available. If you assume that you have available only a frequency distribution for the various bigrams in the text, the way to calculate frequency(token1, token2) is a direct lookup, but what about frequency(~token1, token2)? With no other information available, you’d need to scan every single bigram for the appearance of token2 in the second slot and subtract frequency(token1, token2) from that value. (Take a moment to convince yourself that this is true if it isn’t obvious.)

However, if you assume that you have a frequency distribution available that counts the occurrences of each individual token in the text (the text’s unigrams) in addition to a frequency distribution of the bigrams, there’s a much less expensive shortcut you can take that involves two lookups and an arithmetic operation. Subtract the number of times that token2 appeared as a unigram from the number of times the bigram (token1, token2) appeared, and you’re left with the number of times the bigram (~token1, token2) appeared. For example, if the bigram (“mr.”, “green”) appeared three times and the unigram (“green”) appeared seven times, it must be the case that the bigram (~“mr.”, “green”) appeared four times (where ~“mr.” literally means “any token other than ‘mr.’”). In Table 4-6, the expression frequency(*, token2) represents the unigram token2 and is referred to as a marginal because it’s noted in the margin of the table as a shortcut. The value for frequency(token1, *) works the same way in helping to compute frequency(token1, ~token2), and the expression frequency(*, *) refers to any possible unigram and is equivalent to the total number of tokens in the text. Given frequency(token1, token2), frequency(token1, ~token2), and frequency(~token1, token2), the value of frequency(*, *) is necessary to calculate frequency(~token1, ~token2).

Although this discussion of contingency tables may seem somewhat tangential, it’s an important foundation for understanding different scoring functions. For example, consider the Jaccard Index as introduced back in Chapter 3. Conceptually, it expresses the similarity of two sets and is defined by:

Contingency table example—values in italics represent “marginals,” and values in bold represent frequency counts of bigram variations

In other words, that’s the number of items in common between the two sets divided by the total number of distinct items in the combined sets. It’s worth taking a moment to ponder this simple yet effective calculation. If Set1 and Set2 were identical, the union and the intersection of the two sets would be equivalent to one another, resulting in a ratio of 1.0. If both sets were completely different, the numerator of the ratio would be 0, resulting in a value of 0.0. Then there’s everything else in between.

The Jaccard Index as applied to a particular bigram expresses the ratio between the frequency of a particular bigram and the sum of the frequencies with which any bigram containing a term in the bigram of interest appears. One interpretation of that metric might be that the higher the ratio is, the more likely it is that (token1, token2) appears in the text, and hence the more likely it is that the collocation “token1 token2” expresses a meaningful concept.

The selection of the most appropriate scoring function is usually determined based upon knowledge about the characteristics of the underlying data, some intuition, and sometimes a bit of luck. Most of the association metrics defined in nltk.metrics.associations are discussed in Chapter 5 of Christopher Manning and Hinrich Schütze’s Foundations of Statistical Natural Language Processing (MIT Press), which is conveniently available online and serves as a useful reference for the descriptions that follow.

One of the most fundamental concepts in statistics is a normal distribution. A normal distribution, often referred to as a bell curve because of its shape, is called a “normal” distribution because it is often the basis (or norm) against which other distributions are compared. It is a symmetric distribution that is perhaps the most widely used in statistics. One reason that its significance is so profound is because it provides a model for the variation that is regularly encountered in many natural phenomena in the world, ranging from physical characteristics of populations to defects in manufacturing processes and the rolling of dice.

A rule of thumb that shows why the normal distribution can be so useful is the so-called 68-95-99.7 rule, a handy heuristic that can be used to answer many questions about approximately normal distributions.. For a normal distribution, it turns out that virtually all (99.7%) of the data lies within three standard deviations of the mean, 95% of it lies within two standard deviations, and 68% of it lies within one standard deviation. Thus, if you know that a distribution that explains some real-world phenomenon is approximately normal for some characteristic and its mean and standard deviation are defined, you can reason about it to answer many useful questions. Figure 4-7 illustrates the 68-95-99.7 rule.

Figure 4-7. The normal distribution is a staple in statistical mathematics because it models variance in so many natural phenomena

The Khan Academy’s “Introduction to the Normal Distribution” provides an excellent 30-minute overview of the normal distribution; you might also enjoy the 10-minute segment on the central limit theorem, which is an equally profound concept in statistics in which the normal distribution emerges in a surprising (and amazing) way.

A thorough discussion of these metrics is outside the scope of this book, but the promotional chapter just mentioned provides a detailed account with in-depth examples. The Jaccard Index, Dice’s coefficient, and the likelihood ratio are good starting points if you find yourself needing to build your own collocation detector. They are described, along with some other key terms, in the list that follows:

Raw frequency

As its name implies, raw frequency is the ratio expressing the frequency of a particular n-gram divided by the frequency of all n-grams. It is useful for examining the overall frequency of a particular collocation in a text.

Jaccard Index

The Jaccard Index is a ratio that measures the similarity between sets. As applied to collocations, it is defined as the frequency of a particular collocation divided by the total number of collocations that contain at least one term in the collocation of interest. It is useful for determining the likelihood of whether the given terms actually form a collocation, as well as ranking the likelihood of probable collocations. Using notation consistent with previous explanations, this formulation would be mathematically defined as:

Dice’s coefficient

Dice’s coefficient is extremely similar to the Jaccard Index. The fundamental difference is that it weights agreements among the sets twice as heavily as Jaccard. It is defined mathematically as:

Mathematically, it can be shown fairly easily that:

You’d likely choose to use this metric instead of the Jaccard Index when you’d like to boost the score to favor overlap between sets, which may be handy when one or more of the differences between the sets are high. The reason is that a Jaccard score inherently diminishes as the cardinality of the set differences increases in size, since the union of the set is in the denominator of the Jaccard score.

Student’s t-score

Traditionally, Student’s t-score has been used for hypothesis testing, and as applied to n-gram analysis, t-scores can be used for testing the hypothesis of whether two terms are collocations. The statistical procedure for this calculation uses a standard distribution per the norm for t-testing. An advantage of the t-score values as opposed to raw frequencies is that a t-score takes into account the frequency of a bigram relative to its constituent components. This characteristic facilitates ranking the strengths of collocations. A criticism of the t-test is that it necessarily assumes that the underlying probability distribution for collocations is normal, which is not often the case.

Chi-square

Like Student’s t-score, this metric is commonly used for testing independence between two variables and can be used to measure whether two tokens are collocations based upon Pearson’s chi-square test of statistical significance. Generally speaking, the differences obtained from applying the t-test and chi-square test are not substantial. The advantage of chi-square testing is that unlike t-testing, it does not assume an underlying normal distribution; for this reason, chi-square testing is more commonly used.

Likelihood ratio

This metric is yet another approach to hypothesis testing that is used to measure the independence between terms that may form a collocation. It’s been shown to be a more appropriate approach for collocation discovery than the chi-square test in the general case, and it works well on data that includes many infrequent collocations. The particular calculations involved in computing likelihood estimates for collocations as implemented by NLTK assume a binomial distribution, where the parameters governing the distribution are calculated based upon the number of occurrences of collocations and constituent terms.

Pointwise Mutual Information

Pointwise Mutual Information (PMI) is a measure of how much information is gained about a particular word if you also know the value of a neighboring word. To put it another way, it refers to how much one word can tell you about another. Ironically (in the context of the current discussion), the calculations involved in computing the PMI lead it to score high-frequency words lower than low-frequency words, which is the opposite of the desired effect. Therefore, it is a good measure of independence but not a good measure of dependence (i.e., it’s a less-than-ideal choice for scoring collocations). It has also been shown that sparse data is a particular stumbling block for PMI scoring, and that other techniques such as the likelihood ratio tend to outperform it.

Evaluating and determining the best method to apply in any particular situation is often as much art as science. Some problems are fairly well studied and provide a foundation that guides additional work, while some circumstances often require more novel research and experimentation. For most nontrivial problems, you’ll want to consider exploring the latest scientific literature (whether it be a textbook or a white paper from academia that you find with Google Scholar) to determine if a particular problem you are trying to solve has been well studied.

Reflections on Analyzing Human Language Data

This chapter has introduced a variety of tools and processes for analyzing human language data, and some closing reflections may be helpful in synthesizing its content:

Context drives meaning

While TF-IDF is a powerful tool that’s easy to use, our specific implementation of it has a few important limitations that we’ve conveniently overlooked but that you should consider. One of the most fundamental is that it treats a document as a “bag of words,” which means that the order of terms in both the document and the query itself does not matter. For example, querying for “Green Mr.” would return the same results as “Mr. Green” if we didn’t implement logic to take the query term order into account or interpret the query as a phrase as opposed to a pair of independent terms. But obviously, the order in which terms appear is very important.

In performing an n-gram analysis to account for collocations and term ordering, we still face the underlying issue that TF-IDF assumes that all tokens with the same text value mean the same thing. Clearly, however, this need not be the case. A homonym is a word that has identical spellings and pronunciations to another word but whose meaning is driven entirely by context, and any homonym of your choice is a counterexample. Homonyms such as book, match, cave, and cool are a few examples that should illustrate the importance of context in determining the meaning of a word.

Cosine similarity suffers from many of the same flaws as TF-IDF. It does not take into account the context of the document or the term order from the n-gram analysis, and it assumes that terms appearing close to one another in vector space are necessarily similar, which is certainly not always the case. As with TF-IDF, the obvious counterexample is homonyms. Our particular implementation of cosine similarity also hinges on TF-IDF scoring as its means of computing the relative importance of words in documents, so the TF-IDF errors have a cascading effect.

Human language is overloaded with context

You’ve probably noticed that there can be a lot of pesky details that have to be managed in analyzing unstructured text, and these details turn out to be pretty important for competitive implementations. For example, string comparisons are case-sensitive, so it’s important to normalize terms so that frequencies can be calculated as accurately as possible. However, blindly normalizing to lowercase can also complicate the situation since the case used in certain words and phrases can be important.

“Mr. Green” and “Web 2.0” are two examples worth considering. In the case of “Mr. Green,” maintaining the title case in “Green” could potentially be advantageous since it could provide a useful clue to a query algorithm that this term is not referring to an adjective and is likely part of a noun phrase. We’ll briefly touch on this topic again in Chapter 5 when NLP is discussed, since it’s ultimately the context in which “Green” is being used that is lost with the bag-of-words approach, whereas more advanced parsing with NLP has the potential to preserve that context.

Parsing context from human language isn’t easy

Another consideration that’s rooted more in our particular implementation than a general characteristic of TF-IDF itself is that our use of split to tokenize the text may leave trailing punctuation on tokens that can affect tabulating frequencies. For example, in Example 4-6, corpus['b'] ends with the token “study.”; this is not the same as the token “study” that appears in corpus['a'] (the token that someone would probably be more likely to query). In this instance, the trailing period on the token affects both the TF and the IDF calculations. Something as seemingly simple as a period signaling the end of a sentence is context that our brain processes trivially, but it’s much more difficult for a machine to do this with the same level of accuracy.

Writing software to help machines better understand the context of words as they appear in human language data is a very active area of research and has tremendous potential for the future of search technology and the Web.

Closing Remarks

This chapter introduced the Google+ API and how to collect and cleanse human language data as part of an exercise in querying for a person’s Google+ activities. We then spent some time learning about a few of the fundamentals of IR theory, TF-IDF, cosine similarity, and collocations as the means of analyzing the data we collect. Eventually we worked up to the point where we were considering some of the same problems that any search engine provider has had to consider to build a successful technology product. However, even though I hope this chapter has given you some good insight into how to extract useful information from unstructured text, it’s barely scratched the surface of the most fundamental concepts, both in terms of theory and engineering considerations. Information retrieval is literally a multibillion-dollar industry, so you can only imagine the amount of combined investment that goes into both the theory and implementations that work at scale to power search engines such as Google and Bing.

Given the immense power of search providers like Google, it’s easy to forget that these foundational search techniques even exist. However, understanding them yields insight into the assumptions and limitations of the commonly accepted status quo for search, while also clearly differentiating the state-of-the-art, entity-centric techniques that are emerging. Chapter 5 introduces a fundamental paradigm shift away from some of the techniques in this chapter. There are lots of exciting opportunities for technology-driven companies that can effectively analyze human language data.

Note

The source code outlined for this chapter and all other chapters is available at GitHub in a convenient IPython Notebook format that you’re highly encouraged to try out from the comfort of your own web browser.

Recommended Exercises

Take advantage of IPython Notebook’s plotting features, introduced in Chapter 1, to graph Zipf’s curve for the tokens from a corpus.
Mine the comments feed and try to identify trends based on frequency of commenting. For example, who are the most frequent commenters across a few hundred activities for a popular Google+ user like Tim O’Reilly?
Mine Google+ activities to discover which activities are the most popular. A suitable starting point might be the number of comments and number of times a post is reshared.
Fetch the content from links that are referenced in Google+ activities and adapt the cleanHtml function from this chapter to extract the text across the web pages in a user’s activity stream for analysis. Are there any common themes in links that are shared? What are the most frequent words that appear in the text?
If you’d like to try applying the techniques from this chapter to the Web (in general), you might want to check out Scrapy, an easy-to-use and mature web scraping and crawling framework that can help you to harvest web pages.
Spend some time and add interactive capabilities to the matrix diagram presented in this chapter. Can you add event handlers to automatically take you to the post when text is clicked on? Can you conceive of any meaningful ways to order the rows and columns so that it’s easier to identify patterns?
Update the code that emits the JSON that drives the matrix diagram so that it computes similarity differently and thus correlates documents differently from the default implementation.
What additional features from the text can you think of that would make computing similarity across documents more accurate?
Spend some time really digging into the theory presented in this chapter for the underlying IR concepts that were presented.

Online Resources

The following list of links from this chapter may be useful for review:

^[13]This book avoids splitting hairs over exactly what differences could be implied by common phrases such as text mining, unstructured data analytics (UDA), or information retrieval, and simply treats them as essentially the same thing.

^[14]Google+ only offered personalized URLs such as https://plus.google.com/+TimOReilly to well-known individuals at the outset, but is in the process of making them available to all users. Until you are either selected by Google or eligible to apply, you’ll have to continue using your more arcane URL with a numeric identifier in it (such as https://plus.google.com/107033731246200681024).

^[15]Stopwords are words that appear frequently in text but usually relay little information. Common examples of stopwords are a, an, the, and other determinants.

^[16]The word the accounts for 7% of the tokens in the Brown Corpus and provides a reasonable starting point for a corpus if you don’t know anything else about it.

Get Mining the Social Web, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mining the Social Web, 2nd Edition by Matthew A. Russell

Chapter 4. Mining Google+: Computing Document Similarity, Extracting Collocations, and More

Note

Overview

Exploring the Google+ API

Making Google+ API Requests

Note

A Whiz-Bang Introduction to TF-IDF

Note

Term Frequency

Inverse Document Frequency

TF-IDF

Querying Human Language Data with TF-IDF

Introducing the Natural Language Toolkit

Note

Note

Note

Applying TF-IDF to Human Language

Finding Similar Documents

The theory behind vector space models and cosine similarity

Warning

Clustering posts with cosine similarity

Visualizing document similarity with a matrix diagram

Note

Analyzing Bigrams in Human Language

Note

Contingency tables and scoring functions

Note

Reflections on Analyzing Human Language Data

Closing Remarks

Note

Recommended Exercises

Online Resources

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly