Chapter 2. Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More
In this chapter, we’ll tap into the Facebook platform through its (Social) Graph API and explore some of the vast possibilities. Facebook is arguably the heart of the social web and is somewhat of an all-in-one wonder, given that more than half of its 1 billion users[2] are active each day updating statuses, posting photos, exchanging messages, chatting in real time, checking in to physical locales, playing games, shopping, and just about anything else you can imagine. From a social web mining standpoint, the wealth of data that Facebook stores about individuals, groups, and products is quite exciting, because Facebook’s clean API presents incredible opportunities to synthesize it into information (the world’s most precious commodity), and glean valuable insights. On the other hand, this great power commands great responsibility, and Facebook has instrumented the most sophisticated set of online privacy controls that the world has ever seen in order to help protect its users from exploit.
It’s worth noting that although Facebook is self-proclaimed as a social graph, it’s been steadily transforming into a valuable interest graph as well, because it maintains relationships between people and the things that they’re interested in through its Facebook pages and “Likes” feature. In this regard, you may increasingly hear it framed as a “social interest graph.” For the most part, you can make a case that interest graphs implicitly exist and can be bootstrapped from most sources of social data. As an example, Chapter 1 made the case that Twitter is actually an interest graph because of its asymmetric “following” (or, to say it another way, “interested in”) relationships between people and other people, places, or things. The notion of Facebook as an interest graph will come up throughout this chapter, and we’ll return to the idea of explicitly bootstrapping an interest graph from social data in Chapter 7.
The remainder of this chapter assumes that you have an active Facebook account, which is required to gain access to the Facebook APIs. Although there are plenty of fun things that you can do to analyze public information, you may find that it’s a little bit more fun if you are analyzing data from your own social network, so it’s worth adding a few friends if you are new to Facebook.
Note
Always get the latest bug-fixed source code for this chapter (and every other chapter) online at http://bit.ly/MiningTheSocialWeb2E. Be sure to also take advantage of this book’s virtual machine experience, as described in Appendix A, to maximize your enjoyment of the sample code.
Overview
As this is the second chapter in the book, the concepts we’ll cover are a bit more complex than those in Chapter 1, but should still be highly accessible for a very broad audience. In this chapter you’ll learn about:
Facebook’s Social Graph API and how to make API requests
The Open Graph protocol and its relationship to Facebook’s Social Graph
Analyzing likes from Facebook pages and from Facebook friends
Techniques such as clique analysis for analyzing social graphs
Visualizing social graphs with the D3 JavaScript library
Exploring Facebook’s Social Graph API
The Facebook platform is a mature, robust, and well-documented gateway into what may be the most comprehensive and well-organized information store ever amassed, both in terms of breadth and depth. It’s broad in that its user base represents about one-seventh of the entire living population, and it’s deep with respect to the amount of information that’s known about any one of its particular users. Whereas Twitter features an asymmetric friendship model that is open and predicated on following other users without any particular consent, Facebook’s friendship model is symmetric and requires a mutual agreement between users to gain visibility into one another’s interactions and activities.
Furthermore, whereas virtually all interactions except for private messages between users on Twitter are public statuses, Facebook allows for much more finely grained privacy controls in which friendships can be organized and maintained as lists with varying levels of visibility available to a friend on any particular activity. For example, you might choose to share a link or photo only with a particular list of friends as opposed to your entire social network.
As a social web miner, the only way that you can access a Facebook user’s account data is by registering an application and using that application as the entry point into the Facebook developer platform. Moreover, the only data that’s available to an application is whatever the user has explicitly authorized it to access. For example, as a developer writing a Facebook application, you’ll be the user who’s logging into the application, and the application will be able to access any data that you explicitly authorize it to access. In that regard, as a Facebook user you might think of an application a bit like any of your Facebook friends, in that you’re ultimately in control of what the application can access and you can revoke access at any time. The Facebook Platform Policies document is a must-read for any Facebook developer, as it provides the comprehensive set of rights and responsibilities for all Facebook users as well as the spirit and letter of the law for Facebook developers. If you haven’t already, it’s worth taking a moment to review Facebook’s developer policies and to bookmark the Facebook Developers home page, since it is the definitive entry point into the Facebook platform and its documentation.
Note
Keep in mind that as a developer mining your own account, you may not have a problem allowing your own application to access all of your account data. Beware, however, of aspiring to develop a successful hosted application that requests access to more than the minimum amount of data necessary to complete, because it’s quite likely that a user will not recognize or trust your application to command that level of privilege (and rightly so).
Although we’ll programmatically access the Facebook platform later in this chapter, Facebook provides a number of useful developer tools, including a Graph API Explorer app that we’ll be using for initial familiarization with the Social Graph. The app provides an intuitive and turnkey way of querying the Social Graph, and once you’re comfortable with how the Social Graph works, translating queries into Python code for automation and further processing comes quite naturally. Although we’ll work through the Graph API as part of the discussion, you may benefit from an initial review of the well-written “Getting Started: The Graph API” document as a comprehensive preamble.
Warning
In addition to the Graph API, you may also encounter the Facebook Query Language (FQL) and what is now referred to as the Legacy REST API. Be advised that although FQL is still very much alive and will be briefly introduced in this chapter, the Legacy REST API is in deprecation and will be phased out soon. Do not use it for any development that you do with Facebook.
Understanding the Social Graph API
As its name implies, Facebook’s Social Graph is a massive graph data structure representing social interactions and consisting of nodes and connections between the nodes. The Graph API provides the primary means of interacting with the Social Graph, and the best way to get acquainted with the Graph API is to spend a few minutes tinkering around with the Graph API Explorer.
It is important to note that the Graph API Explorer is not a particularly special tool of any kind. Aside from being able to prepopulate and debug your access token, it is an ordinary Facebook app that uses the same developer APIs that any other developer application would use. In fact, the Graph API Explorer is handy when you have a particular OAuth token that’s associated with a specific set of authorizations for an application that you are developing and you want to run some queries as part of an exploratory development effort or debug cycle. We’ll revisit this general idea shortly as we programmatically access the Graph API. Figures 2-1 through 2-4 illustrate a progressive series of Graph API queries that result from clicking on the plus (+) symbol and adding connections and fields. There are a few items to note about this particular query:
- Access token
The access token that appears in the application is an OAuth token that is provided as a courtesy for the logged-in user; it is the same OAuth token that your application would need to access the data in question. We’ll opt to use this access token throughout this chapter, but you can consult Appendix B for a brief overview of OAuth, including details on implementing an OAuth flow for Facebook in order to retrieve an access token. As mentioned in Chapter 1, if this is your first encounter with OAuth, it’s probably sufficient at this point to know that the protocol is a social web standard that stands for Open Authorization. In short, OAuth is a means of allowing users to authorize third-party applications to access their account data without needing to share sensitive information like a password.
Note
See Appendix B for details on implementing an OAuth 2.0 flow that you would need to build an application that requires an arbitrary user to authorize it to access account data.
- Node IDs
The basis of the query is a node with an ID (identifier) of “644382747,” corresponding to a person named “Matthew A. Russell,” who is preloaded as the currently logged-in user for the Graph Explorer. The “id” and “name” values for the node are called fields. The basis of the query could just as easily have been any other node, and as we’ll soon see, it’s very natural to “walk” or traverse the graph and query other nodes (which may be people or things as books or TV shows).
- Connection constraints
You can modify the original query with a “friends” connection, as shown in Figure 2-2, by clicking on the + and then scrolling to “friends” in the “connections” pop-up menu. The “friends” connections that appear in the console represent nodes that are connected to the original query node. At this point, you could click on any of the blue ID fields in these nodes and initiate a query with that particular node as the basis. In network science terminology, we now have what is called an ego graph, because it has an actor (or ego) as its focal point or logical center, which is connected to other nodes around it. An ego graph would resemble a hub and spokes if you were to draw it.
- Likes constraints
A further modification to the original query is to add “likes” connections for each of your friends, as shown in Figure 2-3. Before you can retrieve likes connections for your friends, however, you must authorize the Graph API Explorer application to explicitly access your friends’ likes by updating the access token that it uses and then approve this access, as shown in Figure 2-4. The Graph API Explorer allows you to easily authorize it by clicking on the Get Access Token button and checking the “friends_likes” box on the Friends Data Permissions tab. In network science terminology, we still have an ego graph, but it’s potentially much more complex at this point because of the many additional nodes and connections that could exist among them.
- Debugging
The Debug button can be useful for troubleshooting queries that you think should be returning data but aren’t doing so based on the authorizations associated with the access token.
- JSON response format
The results of a Graph API query are returned in a convenient JSON format that can be easily manipulated and processed.
Although we’ll programmatically explore the Graph API with a
Python package later in this chapter, you could opt to make Graph API
queries more directly over HTTP yourself by mimicking the request that
you see in the Graph API Explorer. For example, Example 2-1 uses the requests
package
to simplify the process of making an HTTP request (as
opposed to using a much more cumbersome package from Python’s standard library, such as urllib2
) for fetching your friends and their
likes. You can install this package in a terminal with the predictable pip install
requests
command. The query is driven by the values in the
fields
parameter and is the same as
what would be built up interactively in the Graph API Explorer. Of
particular interest is that the friends.limit(10).fields(likes.limit(10))
syntax uses a relatively new feature of the Graph API called field
expansion that is designed to make and parameterize multiple
queries in a single API call.
import
requests
# pip install requests
import
json
base_url
=
'https://graph.facebook.com/me'
# Get 10 likes for 10 friends
fields
=
'id,name,friends.limit(10).fields(likes.limit(10))'
url
=
'
%s
?fields=
%s
&access_token=
%s
'
%
\(
base_url
,
fields
,
ACCESS_TOKEN
,)
# This API is HTTP-based and could be requested in the browser,
# with a command line utlity like curl, or using just about
# any programming language by making a request to the URL.
# Click the hyperlink that appears in your notebook output
# when you execute this code cell to see for yourself...
url
# Interpret the response as JSON and convert back
# to Python data structures
content
=
requests
.
get
(
url
)
.
json
()
# Pretty-print the JSON and display it
json
.
dumps
(
content
,
indent
=
1
)
If you attempt to run a query for all of your friends’ likes by
setting fields =
'id,name,friends.fields(likes)
, and the script appears to
hang, it is probably because you have a lot of friends who have a lot of
likes. If this happens, you may need to add limits and offsets to the
fields in the query, as described in Facebook’s field expansion documentation.
However, the facebook
package
that you’ll learn about later in this chapter handles some of these
issues, so it’s recommended that you hold off and try it out first. This
initial example is just to illustrate that Facebook’s API is built on
top of HTTP. A couple of field limit/offset examples that illustrate the
possibilities with field selectors follow:
# Get all likes for 10 friends
fields
=
'id,name,friends.limit(10).fields(likes)'
# Get all likes for 10 more friends
fields
=
'id,name,friends.offset(10).limit(10).fields(likes)'
# Get 10 likes for all friends
fields
=
'id,name,friends.fields(likes.limit(10))'
It appears as though the default limit for queries at the time of this writing is to return up to 5,000 items. It’s possible but somewhat unlikely that you’ll be making Graph API queries that could return more than 5,000 items; if you do, consult the pagination documentation for information on how to navigate through the “pages” of results.
Understanding the Open Graph Protocol
In addition to sporting a powerful Graph API that allows you to traverse the Social Graph and query familiar Facebook objects, you should also know that Facebook unveiled something called the Open Graph protocol (OGP) back in April 2010, at the same F8 conference at which it introduced the Social Graph. In short, OGP is a mechanism that enables developers to make any web page an object in Facebook’s Social Graph by injecting some RDFa metadata into the page. Thus, in addition to being able to access from within Facebook’s “walled garden” the dozens of objects that are described in the Graph API Reference (users, pictures, videos, checkins, links, status messages, etc.), you might also encounter pages from the Web that represent meaningful concepts that have been grafted into the Social Graph. In other words, OGP is a means of “opening up” the Social Graph, and you’ll see these concepts described in Facebook’s developer documentation as its “Open Graph.”[3]
There are practically limitless options for leveraging OGP to graft web pages into the Social Graph in valuable ways, and the chances are good that you’ve already encountered many of them and not even realized it. For example, consider, which illustrates a page for the movie The Rock from IMDb.com. In the sidebar to the right, you see a rather familiar-looking Like button with the message “19,319 people like this. Be the first of your friends.” IMDb enables this functionality by implementing OGP for each of its URLs that correspond to objects that would be relevant for inclusion in the Social Graph. With the right RDFa metadata in the page, Facebook is then able to unambiguously enable connections to these objects and incorporate them into activity streams and other key elements of the Facebook user experience.
Implementation of OGP manifesting as Like buttons on web pages may seem a bit obvious if you’ve gotten used to seeing them over the past few years, but the fact that Facebook has been fairly successful at opening up its development platform in a way that allows for arbitrary inclusion of objects on the Web is rather profound and has some potentially significant consequences.
For example, at the time of this writing in early 2013, Facebook has just started the process of launching its new Graph Search product to a limited audience. Whereas companies like Google crawl and index the entire Web in order to enable search, the basic idea behind Facebook’s Graph Search is that you type something into a search box, just like in the typical Google user experience, but you get back results that are personalized to you based upon the vast amount of your information that Facebook has. The rub, now that OGP is fairly well established, is that Facebook’s Graph Search results won’t be limited to things within the Facebook user experience, since connections from the Web are inherently incorporated into the Social Graph. It’s out of scope to ponder the wider ramifications of how disruptive Graph Search could be to the Web given Facebook’s user base, but it’s a thought exercise well worth your time.
Let’s briefly take a look at the gist of implementing OGP before moving on to Graph API queries. The canonical example from the OGP documentation that demonstrates how to turn IMDb’s page on The Rock into an object in the Open Graph protocol as part of an XHTML document that uses namespaces looks something like this:
<html
xmlns:og=
"http://ogp.me/ns#"
>
<head>
<title>
The Rock (1996)</title>
<meta
property=
"og:title"
content=
"The Rock"
/>
<meta
property=
"og:type"
content=
"movie"
/>
<meta
property=
"og:url"
content=
"http://www.imdb.com/title/tt0117500/"
/>
<meta
property=
"og:image"
content=
"http://ia.media-imdb.com/images/rock.jpg"
/>
...</head>
...</html>
These bits of metadata have great potential once realized at a massive scale, because they enable a URI like http://www.imdb.com/title/tt0117500 to unambiguously represent any web page—whether it’s for a person, company, product, etc.—in a machine-readable way and further the vision for a semantic web. In addition to being able to “like” The Rock, users could potentially interact with this object in other ways through custom actions. For example, users might be able to indicate that they have watched The Rock, since it is a movie. OGP allows for a wide and flexible set of actions between users and objects as part of the Social Graph.
Note
If you haven’t already, go ahead and view the source HTML for http://www.imdb.com/title/tt0117500 and see for yourself what the RDFa looks like out in the wild.
At its core, querying the Graph API for Open Graph objects is incredibly simple: append a web page URL or an object’s ID to http(s)://graph.facebook.com/ to fetch details about the object. For example, fetching the URL http://graph.facebook.com/http://www.imdb.com/title/tt0117500 in your web browser would return this response:
{
"id"
:
"114324145263104"
,
"name"
:
"The Rock (1996)"
,
"picture"
:
"http://profile.ak.fbcdn.net/hprofile-ak-snc4/hs344.snc4/...jpg"
,
"link"
:
"http://www.imdb.com/title/tt0117500/"
,
"category"
:
"Movie"
,
"description"
:
"Directed by Michael Bay. With Sean Connery, ..."
,
"likes"
:
3
}
If you inspect the source for the URL http://www.imdb.com/title/tt0117500,
you’ll find that fields in the response correspond to the data in the
meta
tags of the page, and this is no coincidence. The
delivery of rich metadata in response to a simple query is the whole
idea behind the way OGP is designed to work. Where it gets more
interesting is when you explicitly request additional metadata for an
object in the page by appending the query string parameter
metadata=1
to the request. Here is a sample response for
the query https://graph.facebook.com/114324145263104?metadata=1
in which we use its ID instead of the IMDB web page URL:
{
"id"
:
"114324145263104"
,
"name"
:
"The Rock (1996)"
,
"picture"
:
"http://profile.ak.fbcdn.net/hprofile-ak-snc4/..._s.jpg"
,
"link"
:
"http://www.imdb.com/title/tt0117500"
,
"category"
:
"Movie"
,
"website"
:
"http://www.imdb.com/title/tt0117500"
,
"description"
:
"Directed by Michael Bay. With Sean Connery, ..."
,
"about"
:
"Directed by Michael Bay. With Sean Connery, Nicolas Cage, ..."
,
"likes"
:
8606
,
"were_here_count"
:
0
,
"talking_about_count"
:
0
,
"is_published"
:
true
,
"app_id"
:
115109575169727
,
"metadata"
:
{
"connections"
:
{
"feed"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
"posts"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
"tagged"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
"statuses"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
"links"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
"notes"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
"photos"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
"albums"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
"events"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
"videos"
:
"http://graph.facebook.com/http://www.imdb.com/title/..."
,
},
"fields"
:
[
{
"name"
:
"id"
,
"description"
:
"The Page's ID. Publicly available. A JSON string."
},
{
"name"
:
"name"
,
"description"
:
"The Page's name. Publicly available. A JSON string."
},
{
"name"
:
"category"
,
"description"
:
"The Page's category. Publicly available. ..."
},
{
"name"
:
"likes"
,
"description"
:
"\\* The number of users who like the Page..."
},
...
]
},
"type"
:
"page"
}
The items in metadata.connections
are pointers to
other nodes in the graph that you can crawl to get to other intriguing
bits of data. For example, you could follow the “photos” link to pull
down photos associated with the movie, and potentially walk links
associated with the photos to discover who posted them or see comments
that might have been made about them. In case it hasn’t already occurred
to you, you are also an object in the graph. Try visiting the same URL
prefix, but substitute in your own Facebook ID or username as the URL
context and see for yourself (e.g., visit
https://graph.facebook.com/<YOUR_FB_ID> in
your web browser).
Note
Try using the Facebook ID “MiningTheSocialWeb” to retrieve
details about the official
Facebook fan page for this book with the Graph API Explorer.
You could also modify Example 2-1 to
programmatically query for
https://graph.facebook.com/MiningTheSocialWeb to
retrieve basic page information, including content posted to the page.
For example, appending a query string with a qualifier such as
"?fields=posts"
to that URL would
return a listing of its posted content.
As a final note of advice before moving on to programmatically
accessing the Graph API, when considering the possibilities with OGP be
forward-thinking and creative, but bear in mind that it’s still
evolving. As it relates to the semantic web and web standards in
general, the use of “open”
has understandably generated some consternation. Various kinks in the spec have been worked
out along the way, and some are still probably being worked out. You could also
make the case that OGP is essentially a single-vendor effort, and it’s
little more than on par with the capabilities of meta
elements from the much earlier days of the Web, although the
social effects appear to be driving a very different outcome.
Whether OGP and Graph Search will one day dominate the Web is a highly contentious topic, the potential is certainly there; the indicators for its success are trending in a positive direction, and many exciting things may happen as the future unfolds and innovation continues to take place. Let’s now turn back and hone in on how to access the Graph API to work now that you have an appreciation for the fuller context of the Social Graph.
Analyzing Social Graph Connections
An official Python SDK for the Graph API is a community fork of that
repository previously maintained by Facebook and can be installed per the
standard protocol with pip
via pip install
facebook-sdk
. This package contains a few useful convenience
methods that allow you to interact with Facebook in a number of ways,
including the ability to make FQL queries and post statuses or photos.
However, there are really just a few key methods from the
GraphAPI
class (defined in the
facebook.py source file) that you need to know about
in order to use the Graph API to fetch data as shown next, so you could
just as easily opt to query over HTTP directly with requests
(as was illustrated in Example 2-1) if you prefer. The methods are:
get_object(self, id, **args)
get_objects(self, id, **args)
Example usage:
get_objects(["me", "some_other_id"], metadata=1)
get_connections(self, id, connection_name, **args)
request(self, path, args=None, post_args=None)
Example usage:
request("search", {"q" : "social web", "type" : "page"})
Note
Unlike with other social networks, there don’t appear to be clearly published guidelines about Facebook API rate limits. Although the availability of the APIs seems to be quite generous, you should still carefully design your application to use the APIs as little as possible and handle any and all error conditions as a recommended best practice.
The most common (and often, the only) keyword argument you’ll
probably use is metadata=1
, in order to get back the
connections associated with an object in addition to just the object
details themselves. Take a look at Example 2-2, which introduces the
GraphAPI
class and uses its get_objects
method
to query for information about you, information about your friends, and
the term social web. This example also introduces a
helper function called pp
that is used
throughout the remainder of this chapter for pretty-printing results as
nicely formatted JSON to save some typing.
import
# pip install facebook-sdk
import
json
# A helper function to pretty-print Python objects as JSON
def
pp
(
o
):
json
.
dumps
(
o
,
indent
=
1
)
# Create a connection to the Graph API with your access token
g
=
.
GraphAPI
(
ACCESS_TOKEN
)
# Execute a few sample queries
'---------------'
'Me'
'---------------'
pp
(
g
.
get_object
(
'me'
))
'---------------'
'My Friends'
'---------------'
pp
(
g
.
get_connections
(
'me'
,
'friends'
))
'---------------'
'Social Web'
'---------------'
pp
(
g
.
request
(
"search"
,
{
'q'
:
'social web'
,
'type'
:
'page'
}))
Sample results for the queries from Example 2-2 are shown below and are predictable enough. If you were using the Graph API Explorer, the results would be identical. During development, it can often be very handy to use the Graph API Explorer and an IPython or IPython Notebook in tandem, depending on your specific objective. The advantage of the Graph API Explorer is the ease with which you can click on ID values and spawn new queries during exploratory efforts. Sample results from Example 2-2 follow:
---------------
Me
---------------
{
"last_name"
:
"Russell"
,
"relationship_status"
:
"Married"
,
"locale"
:
"en_US"
,
"hometown"
:
{
"id"
:
"104012476300889"
,
"name"
:
"Princeton, West Virginia"
},
"quotes"
:
"The only easy day was yesterday."
,
"favorite_athletes"
:
[
{
"id"
:
"112063562167357"
,
"name"
:
"Rich Froning Jr. Fan Site"
}
],
"timezone"
:
-
5
,
"education"
:
[
{
"school"
:
{
"id"
:
"112409175441352"
,
"name"
:
"United States Air Force Academy"
},
"type"
:
"College"
,
"year"
:
{
"id"
:
"194603703904595"
,
"name"
:
"2003"
}
}
],
"id"
:
"644382747"
,
"first_name"
:
"Matthew"
,
"middle_name"
:
"A."
,
"languages"
:
[
{
"id"
:
"106059522759137"
,
"name"
:
"English"
},
{
"id"
:
"312525296370"
,
"name"
:
"Spanish"
}
],
"location"
:
{
"id"
:
"103078413065161"
,
"name"
:
"Franklin, Tennessee"
},
"email"
:
"ptwobrussell@gmail.com"
,
"username"
:
"ptwobrussell"
,
"bio"
:
"How I Really Feel About Using Facebook (Or: A Disclaimer)..."
,
"birthday"
:
"06/17/1981"
,
"link"
:
"http://www.facebook.com/ptwobrussell"
,
"verified"
:
true
,
"name"
:
"Matthew A. Russell"
,
"gender"
:
"male"
,
"work"
:
[
{
"position"
:
{
"id"
:
"135722016448189"
,
"name"
:
"Chief Technology Officer (CTO)"
},
"start_date"
:
"0000-00"
,
"employer"
:
{
"id"
:
"372007624109"
,
"name"
:
"Digital Reasoning"
}
}
],
"updated_time"
:
"2013-04-04T14:09:22+0000"
,
"significant_other"
:
{
"name"
:
"Bas Russell"
,
"id"
:
"6224364"
}
}
---------------
My
Friends
---------------
{
"paging"
:
{
"next"
:
"https://graph.facebook.com/644382747/friends?..."
,
},
"data"
:
[
{
"name"
:
"Bas Russell"
,
"id"
:
"6224364"
},
...
{
"name"
:
"Jamie Lesnett"
,
"id"
:
"100002388496252"
}
]
}
---------------
Social
Web
---------------
{
"paging"
:
{
"next"
:
"https://graph.facebook.com/search?q=social+web&type=page..."
,
},
"data"
:
[
{
"category"
:
"Book"
,
"name"
:
"Mining the Social Web"
,
"id"
:
"146803958708175"
},
{
"category"
:
"Internet/software"
,
"name"
:
"Social & Web Marketing"
,
"id"
:
"172427156148334"
},
{
"category"
:
"Internet/software"
,
"name"
:
"Social Web Alliance"
,
"id"
:
"160477007390933"
},
...
{
"category"
:
"Local business"
,
"name"
:
"Social Web"
,
"category_list"
:
[
{
"id"
:
"2500"
,
"name"
:
"Local Business"
}
],
"id"
:
"145218172174013"
}
]
}
At this point, you have the power of both the Graph API Explorer and the Python console—and all that they have to offer—at your fingertips. Now that we’ve scaled the walled garden, let’s turn our attention to analyzing some of its data.
Analyzing Facebook Pages
Although Facebook started out as more of a pure social networking site without a Social Graph or a good way for businesses and other entities to have a presence, it quickly adapted to take advantage of the market needs. Fast-forward a few years, and now businesses, clubs, books, and many other kinds of nonperson entities have Facebook pages with a fan base. Facebook pages are a powerful tool for businesses to engage their customers, and Facebook has gone to some lengths to provide tools that allow Facebook page administrators to understand their fans with a small toolbox that is appropriately called “Insights.”
If you’re already a Facebook user, the chances are pretty good that you’ve already liked one or more Facebook pages that represent something that you approve of or think is interesting, and in this regard, Facebook pages significantly broaden the possibilities for the Social Graph as a platform. The explicit accommodation of nonperson user entities through Facebook pages, the Like button, and the Social Graph fabric collectively provide a powerful arsenal for an interest graph platform, which carries with it a profundity of possibilities. (Refer back to Why Is Twitter All the Rage? for a discussion of why interest graphs are so abundant with useful possibilities.)
Analyzing this book’s Facebook page
Given that this book has a corresponding Facebook page that happened to turn up as the top result in a search for “social web,” it seems natural enough that we could use it as an illustrative starting point for some instructive analysis here in this chapter.[4] Here are just a few questions that might be worth considering with regard to this book’s Facebook page, or just about any other Facebook page:
How popular is the page?
How engaged are the page’s fans?
Are any of the fans for the page particularly outspoken and participatory?
What are the most common topics being talked about on the page?
Your imagination is the only limitation to what you can ask of the Graph API for a Facebook page when you are mining its content for insights, and these questions should get you headed in the right direction. Along the way, we’ll also use these questions as the basis of some comparisons among other pages.
Recall that the starting point for our journey might have been a search for “social web” that revealed a book entitled Mining the Social Web per the following search result item:
{
"category"
:
"Book"
,
"name"
:
"Mining the Social Web"
,
"id"
:
"146803958708175"
}
For any of the items in the search results, we could use
the ID as the basis of a graph query through get_object
with an instance of facebook.GraphAPI
. If you don’t have a
numeric string ID handy, just use the page name (such as
“MiningTheSocialWeb”) that appears in the URL bar of your browser when
you visit the page. The code is a quick one-liner that produces the
results shown in Example 2-3.
# Get an instance of Mining the Social Web
# Using the page name also works if you know it.
# e.g. 'MiningTheSocialWeb' or 'CrossFit'
mtsw_id
=
'146803958708175'
pp
(
g
.
get_object
(
mtsw_id
))
Sample output for the query reveals the data that backs the object’s Facebook page, as shown here:
{
"category"
:
"Book"
,
"username"
:
"MiningTheSocialWeb"
,
"about"
:
"Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social..."
,
"talking_about_count"
:
22
,
"description"
:
"Facebook, Twitter, and LinkedIn generate a tremendous ..."
,
"company_overview"
:
"Like It here on Facebook!\n\nFollow @SocialWebMining..."
,
"release_date"
:
"January 2011"
,
"can_post"
:
true
,
"cover"
:
{
"source"
:
"https://sphotos-b.xx.fbcdn.net/..."
,
"cover_id"
:
474206292634605
,
"offset_x"
:
-41
,
"offset_y"
:
0
},
"mission"
:
"Teaches you how to...\n\n* Get a straightforward synopsis of ..."
,
"name"
:
"Mining the Social Web"
,
"founded"
:
"January 2011"
,
"website"
:
"http://amzn.to/d1Ci8A"
,
"link"
:
"http://www.facebook.com/MiningTheSocialWeb"
,
"likes"
:
911
,
"were_here_count"
:
0
,
"general_info"
:
"Analyzing Data from Facebook, Twitter, LinkedIn, ..."
,
"id"
:
"146803958708175"
,
"is_published"
:
true
}
The interesting analytical results from the query response are
the book’s talking_about_count
and
like_count
. The like_count
is a good indicator of the page’s
overall popularity, so a reasonable response to the query “How popular
is the page?” is that there are 911 Facebook fans for the page, and 22
of them have recently been engaging in discussion. Given that
Mining the Social Web is a fairly niche technical
book, this seems like a reasonable fan base.[5]
For any kind of popularity analysis, however, comparables are essential for understanding the broader context. There are a lot of ways to draw comparisons, but a couple of striking data points are that the book’s publisher, O’Reilly Media, has around 34,000 likes, and the Python programming language has around 80,000 likes. Thus, the popularity of Mining the Social Web is approaching 3% of the publisher’s entire fan base and just over 1% of the programming language’s fan base. Clearly, there is a lot of room for this book’s popularity to grow, even though it’s a niche topic.
Although another good comparison would have been to a niche book similar to Mining the Social Web, it isn’t easy to find any good apples-to-apples comparisons by reviewing Facebook page data, because at the time of this writing it isn’t possible to search for pages and limit by constraints such as “book” as a category. For example, you can’t search for pages and limit the result set to books in order to find a good comparable; instead, you’d have to search for pages and then filter the result set by category to retrieve only the books. Still, there are a couple of options to consider.
One option is to search for another O’Reilly title that you know is a similar kind of niche book, such as Programming Collective Intelligence, and see what turns up. The Graph API search results for a query of “Programming Collective Intelligence” do turn up a community page with almost 400 likes; all things being equal, it’s interesting that a six-year-old book has almost half the number of likes as Mining the Social Web without an active author maintaining a page for it.
Another option to consider is taking advantage of concepts from Facebook’s Open Graph Protocol in order to draw a comparison. For example, the O’Reilly online catalog contains entries and implements OGP for all of O’Reilly’s titles, and there are pages (and thus Like buttons) for both Mining the Social Web, 2nd Edition and Programming Collective Intelligence. We can easily make requests to the Graph API to see what data is available and keep tabs on it by simply querying for these URLs in the browser as follows:
- Graph API query for Mining the Social Web
https://graph.facebook.com/http://shop.oreilly.com/product/0636920030195.do
- Graph API query for Programming Collective Intelligence
https://graph.facebook.com/http://shop.oreilly.com/product/9780596529321.do
In terms of a programmatic query with Python, the URLs are the objects that we are querying (just like the URL for the IMDb entry for The Rock was what we were querying earlier), so in code, we can query these objects as shown in Example 2-4. As a subtle but very important distinction, keep in mind that even though both the O’Reilly catalog page and the Facebook fan page for Mining the Social Web logically represent the same book, the nodes (and accompanying metadata, such as the number of likes) that correspond to the Facebook page versus the O’Reilly catalog page are completely independent. It just so happens that each represents the same real-world concept. Figure 2-6 demonstrates an exploration of the Graph API with IPython Notebook.
Note
An entirely separate kind of analysis known as entity resolution (or entity disambiguation, depending on how you frame the problem) is the process of aggregating mentions of things into a single platonic concept. For example, in this case, an entity resolution process could observe that there are multiple nodes in the Open Graph that actually refer to the same platonic idea of Mining the Social Web and create connections between them indicating that they are in fact equivalent as an entity in the real world. Entity resolution is an exciting field of research that will continue to have profound effects on how we use data as the future unfolds.
# MTSW catalog link
pp
(
g
.
get_object
(
'http://shop.oreilly.com/product/0636920030195.do'
))
# PCI catalog link
pp
(
g
.
get_object
(
'http://shop.oreilly.com/product/9780596529321.do'
))
Although it’s often not the case that you’ll be able to make an apples-to-apples comparison that provides an authoritative result when data mining, there’s still a lot to be learned. Exploring a data set long enough to accumulate strong intuitions about the data often provides all the insight that you’ll initially need when encountering a problem space for the first time. Hopefully, enhancements to the Graph API as part of the new Graph Search product will facilitate more sophisticated queries and lower the barriers to entry for data miners in the future.
Analyzing Coke vs Pepsi Facebook pages
As an alternative to analyzing tech books, let’s take just a moment to broaden the scope of the discussion to something much more mainstream and see what turns up. The never-ending soft drink war between Coke and Pepsi seems like an innocuous but potentially interesting topic to consider, so let’s set out to determine which one is the most popular according to Facebook. As you now know, the answer is just a couple of graph queries away, as illustrated in Example 2-5.
# Find Pepsi and Coke in search results
pp
(
g
.
request
(
'search'
,
{
'q'
:
'pepsi'
,
'type'
:
'page'
,
'limit'
:
5
}))
pp
(
g
.
request
(
'search'
,
{
'q'
:
'coke'
,
'type'
:
'page'
,
'limit'
:
5
}))
# Use the ids to query for likes
pepsi_id
=
'56381779049'
# Could also use 'PepsiUS'
coke_id
=
'40796308305'
# Could also use 'CocaCola'
# A quick way to format integers with commas every 3 digits
def
int_format
(
n
):
return
"{:,}"
.
format
(
n
)
"Pepsi likes:"
,
int_format
(
g
.
get_object
(
pepsi_id
)[
'likes'
])
"Coke likes:"
,
int_format
(
g
.
get_object
(
coke_id
)[
'likes'
])
The results are somewhat striking:
Pepsi likes: 9,677,881 Coke likes: 62,735,664
Would you have expected that Coke has almost seven times the popularity of Pepsi on Facebook? As one possible source of investigation, you might consult stock market information and see if the number of likes correlates at all with the overall market capitalization, which could be an indicator of the overall size of the companies. If you were to look up this information, however, the results might surprise you: at the time of this writing (circa March 2013), the market capitalization of Coke (NYSE:KO) is around 178B, whereas Pepsi (NYSE:PEP) is 121B. Although analysis of companies at a financial level is a very complex exercise in and of itself, the overall market capitalization of the companies differs by only around 30%, which is a far cry from a 700% difference in Facebook popularity. It seems reasonable to think that each company probably has similar means available to it and probably sells similar amounts of product.
A worthwhile exercise would be to drill down further and try to determine what might be the cause of this disparity. In approaching a question like this one, bear in mind that although there are likely to be indicators in the Facebook data itself, the overall scope is very broad, and there may be a number of dependent variables outside of what you might find in Facebook data. For example, are there particular indicators you can find in the data that suggest that Coke launches massive advertising campaigns or does anything special to engage users in a way that Pepsi does not?
Digging further into what now seems like a bit of a phenomenon is left as an exercise. However, here’s a hint to get you on your way: a shallow search of the Web reveals articles on reputable sites such as Forbes entitled “Coca-Cola and Procter and Gamble Lead the Way into the New Advertising Era of SocialTV... A Money Machine” and “Coca-Cola Leveraging Social to Drive Leadership in Social Media Marketing,” indicating that Coca-Cola has intentionally developed marketing campaigns that make extensive use of social media.
There are practically limitless possibilities for analyzing a Facebook page, and a great transition point after you’ve performed frequency analysis is to examine the human language data in the page’s feed, although more specific kinds of filters (such as the page’s shared links) can be queries for analysis as well. We can’t solve every problem in each short chapter, but once you’ve read up on some techniques for processing human language data in Chapters 4 and 5, you’ll be able to return to this problem and apply those techniques to better understand how Coca-Cola’s social media team engages its Facebook fans to gain insight into the communication that takes place between them. Meanwhile, you could frame your intuition by spending a few minutes skimming Coca-Cola’s and Pepsi’s Facebook pages. After all, glossing over the data at a high level whenever possible is an essential prerequisite to programmatic analysis.
Example 2-6 provides a starting point for
data collection if you’d like to examine the human language data from
a page by treating its content as a bag of words with basic frequency
analysis techniques as explained in Chapter 4. In
other words, you could just split the text into words by approximating
word boundaries with whitespace and feed the words into a Counter
to compute
the more frequent terms as a starting point.
pp
(
g
.
get_connections
(
pepsi_id
,
'feed'
))
pp
(
g
.
get_connections
(
pepsi_id
,
'links'
))
pp
(
g
.
get_connections
(
coke_id
,
'feed'
))
pp
(
g
.
get_connections
(
coke_id
,
'links'
))
If you choose to examine the human language data in the pages, here are a few questions for your consideration:
Can you determine which posts in the feed are the most popular, as indicated by either the number of comments posted or the number of likes?
Are any particular kinds of posts more popular than others? For example, are posts with links more popular than posts with photos?
What characteristics can you identify that make a post go viral as opposed to just getting a couple of likes?
Note
Example 2-6 demonstrates how to query for the page’s feed and links to get you started. The differences between feeds, posts, and statuses can initially be a bit confusing. In short, feeds include anything that users might see on their own wall, posts include most any content users have created and posted to their own or a friend’s wall, and statuses include only status updates posted on a user’s own wall. See the Graph API documentation for a user for more details.
Examining Friendships
Let’s now use our knowledge of the Graph API to examine the friendships from your own social network. Here are some questions to get the creative juices flowing:
Are there any topics or special interests that are especially pronounced within your social network?
Does your social network contain many mutual friendships or even larger cliques?
How well connected are the people in your social network?
Are any of your friends particularly outspoken or passionate about anything you might also be interested in learning more about?
The remainder of this section walks through exercises that involve analyzing likes as well as analyzing and visualizing mutual friendships. Although we are framing this section in terms of your social network, bear in mind that the conversation generalizes to any other user’s account and could be realized through a Facebook application you could create and make available.
Analyzing things your friends “like”
Let’s set out to examine the question about whether or not any topics or special interests exist within your social network and explore from there. A logical starting point for answering this query is to aggregate the likes for each of your friends and try to determine if there are any particularly high-frequency items that appear. Example 2-7 demonstrates how to build a frequency distribution of the likes in your social network as the basis for further analysis. Keep in mind that if any of your friends may have privacy settings set to not share certain types of personal information such as their likes with apps, you’ll often see empty results as opposed to any kind of explicit error message.
# First, let's query for all of the likes in your social
# network and store them in a slightly more convenient
# data structure as a dictionary keyed on each friend's
# name. We'll use a dictionary comprehension to iterate
# over the friends and build up the likes in an intuitive
# way, although the new "field expansion" feature could
# technically do the job in one fell swoop as follows:
#
# g.get_object('me', fields='id,name,friends.fields(id,name,likes)')
#
# See Appendix C for more information on Python tips such as
# dictionary comprehensions
friends
=
g
.
get_connections
(
"me"
,
"friends"
)[
'data'
]
likes
=
{
friend
[
'name'
]
:
g
.
get_connections
(
friend
[
'id'
],
"likes"
)[
'data'
]
for
friend
in
friends
}
likes
Note
Reducing the scope of the expected data tends to speed up the
response. If you have a lot of Facebook friends, the previous query
may take some time to execute. Consider trying out the option to use
field expansion and make a single query, or try limiting results
with a list slice such as friends[:100]
to limit the scope of
analysis to 100 of your friends while you are initially exploring
the data.
There’s nothing particularly tricky about collecting your
friends’ likes and building up a nice data structure, although this
might be one of your first encounters with a dictionary comprehension.
Just like a list comprehension, a dictionary comprehension iterates
over a list of items and collects values (key/value pairs in this
case) that are to be returned. You may also want to try out the Graph
API’s new field expansion feature and
issue a single query for all of your friends’ likes in a a
single request. With the facebook
package, you could do it like this: g.get_object('me',
fields='id,name,friends.fields(id,name,likes)')
.
Note
See Appendix C for more information on dictionary comprehensions and other Python tips and tricks.
With a useful data structure called likes
in hand that contains your friends and
their likes, let’s start off our analysis by calculating the most
popular likes across all of your friends. The Counter
class provides an easy way to build a frequency distribution that will
do just the trick, as illustrated in Example 2-8, and we can use the prettytable
package (pip install
prettytable
if you don’t have it already) to neatly format
the results so that they’re more readable.
# Analyze all likes from friendships for frequency
# pip install prettytable
from
prettytable
import
PrettyTable
from
collections
import
Counter
friends_likes
=
Counter
([
like
[
'name'
]
for
friend
in
likes
for
like
in
likes
[
friend
]
if
like
.
get
(
'name'
)])
pt
=
PrettyTable
(
field_names
=
[
'Name'
,
'Freq'
])
pt
.
align
[
'Name'
],
pt
.
align
[
'Freq'
]
=
'l'
,
'r'
[
pt
.
add_row
(
fl
)
for
fl
in
friends_likes
.
most_common
(
10
)
]
'Top 10 likes amongst friends'
pt
Sample results follow:
Top 10 likes amongst friends +-------------------------+------+ | Name | Freq | +-------------------------+------+ | Crossfit Cool Springs | 14 | | CrossFit | 13 | | The Pittsburgh Steelers | 13 | | Working Out | 13 | | The Bible | 13 | | Skiing | 12 | | Star Trek | 12 | | Seinfeld | 12 | | Jesus | 12 | +-------------------------+------+
It appears that exercise/sports is a common theme within this social network, with religion/Christianity possibly being a common theme as well. Let’s dig a little bit further and analyze the categories of likes that exist within the social network to see if the same themes exist. Example 2-9 illustrates a variation of the previous example that shows how.
# Analyze all like categories by frequency
friends_likes_categories
=
Counter
([
like
[
'category'
]
for
friend
in
likes
for
like
in
likes
[
friend
]])
pt
=
PrettyTable
(
field_names
=
[
'Category'
,
'Freq'
])
pt
.
align
[
'Category'
],
pt
.
align
[
'Freq'
]
=
'l'
,
'r'
[
pt
.
add_row
(
flc
)
for
flc
in
friends_likes_categories
.
most_common
(
10
)
]
"Top 10 like categories for friends"
pt
Sample results from the query are a tuple with a similar structure as before:
Top 10 like categories for friends +-------------------------+------+ | Category | Freq | +-------------------------+------+ | Musician/band | 62 | | Book | 46 | | Movie | 43 | | Interest | 40 | | Tv show | 31 | | Public figure | 31 | | Local business | 25 | | Community | 24 | | Non-profit organization | 21 | | Product/service | 17 | +-------------------------+------+
There are no explicit mentions of sports or religion, but it is interesting how much higher the frequencies are on some of the “unexpected” categories such as “Musician/band” or “Book.” It may be that there are simply a lot of highly eclectic, nonoverlapping interests within the social network.
Something that may shed further light on the situation and be compelling in and of itself is to calculate how many likes exist for each friend. For example, do most friends have a similar number of likes, or is the number of likes highly skewed? Having additional insight into the underlying distribution helps to inform some of the things that may be happening when the data is aggregated. In Example 2-10, we’ll calculate a frequency distribution that shows the number of likes for each friend to get an idea of how the categories from the previous example may be skewed.
Note
Example 2-10
introduces the operator.itemgetter
function, which is
commonly used in combination with the sorted
function to sort a list of tuples
(as returned from calling items()
on an instance of a dictionary) based upon a particular slot in the
tuple. For example, passing key=itemgetter(1)
to the sorted
function returns a sorted list that
uses the second item in the tuple as the basis of sorting. See Appendix C for more details.
# Build a frequency distribution of number of likes by
# friend with a dictionary comprehension and sort it in
# descending order
from
operator
import
itemgetter
num_likes_by_friend
=
{
friend
:
len
(
likes
[
friend
])
for
friend
in
likes
}
pt
=
PrettyTable
(
field_names
=
[
'Friend'
,
'Num Likes'
])
pt
.
align
[
'Friend'
],
pt
.
align
[
'Num Likes'
]
=
'l'
,
'r'
[
pt
.
add_row
(
nlbf
)
for
nlbf
in
sorted
(
num_likes_by_friend
.
items
(),
key
=
itemgetter
(
1
),
reverse
=
True
)
]
"Number of likes per friend"
pt
Sample results have the familiar form of a tuple with a friend and frequency value. Some results (sanitized of last names) follow:
Number of likes per friend +--------------------+-----------+ | Friend | Num Likes | +--------------------+-----------+ | Joshua | 187 | | Derek | 146 | | Heather | 84 | | Rick | 69 | | Patrick | 42 | | Bryan | 38 | | Ray | 17 | | Jamie | 14 | | ... | ... | | Bas | 0 | +--------------------+-----------+
The more time you spend really trying to understand the data, the more insights you’ll glean, and by now I hope you are starting to get a more holistic picture of what’s happening. We now know that the distribution of likes across the data is enormously skewed across a small number of friends and that any one friend’s results could be highly contributing to the results that break down the frequencies of category for each like. There are a number of directions that we could go in at this point. One possibility would be to start to compare smaller samples of friends for some kind of similarity or to further analyze likes. For example, does Joshua account for 90% of the liked TV shows? Does Derek account for the most significant majority of liked music? The answers to these questions are well within your grasp at this point.
Instead, however, let’s ask another question : which friends are most similar to the ego[6] in the social network? To make any kind of useful
similarity comparison between two things, we’ll need a similarity
function. The simplest possibility is likely to be one of the best
starting points, so let’s start out with “number of shared likes” to
compute similarity between the ego and friendships. To compute the
similarity between the ego of the network and the friendships, all we
need is the ego’s likes and some help from the set
object’s intersection operator, which makes it possible to
compare two lists of items and compute the overlapping items from each
of them. Example 2-11 illustrates how to
compute the overlapping likes between the ego and friendships in the
network as the first step in finding the most similar friends in the
network.
# Which of your likes are in common with which friends?
my_likes
=
[
like
[
'name'
]
for
like
in
g
.
get_connections
(
"me"
,
"likes"
)[
'data'
]
]
pt
=
PrettyTable
(
field_names
=
[
"Name"
])
pt
.
align
=
'l'
[
pt
.
add_row
((
ml
,))
for
ml
in
my_likes
]
"My likes"
pt
# Use the set intersection as represented by the ampersand
# operator to find common likes.
common_likes
=
list
(
set
(
my_likes
)
&
set
(
friends_likes
))
pt
=
PrettyTable
(
field_names
=
[
"Name"
])
pt
.
align
=
'l'
[
pt
.
add_row
((
cl
,))
for
cl
in
common_likes
]
"My common likes with friends"
pt
Here’s the abbreviated output containing the results of the overlapping likes that are in common for this social network:
My likes +-------------------------------+ | Name | +-------------------------------+ | Snatch (weightlifting) | | First Blood | | Robinson Crusoe | | The Godfather | | The Godfather | | ... | | The Art of Manliness | | USA Triathlon | | CrossFit | | Mining the Social Web | +-------------------------------+ My common likes with friends +-------------------------------+ | Name | +-------------------------------+ | www.SEALFIT.com | | Rich Froning Jr. Fan Site | | CrossFit | | The Great Courses | | The Art of Manliness | | Dan Carlin - Hardcore History | | Mining the Social Web | | Crossfit Cool Springs | +-------------------------------+
Coming back full circle,
it’s perhaps not too surprising that the common theme of
sports/exercise once again emerges (but with additional detail this
time), as do some topics related to Christianity.[7] There are many more engaging questions to ask (and
answer), but let’s wrap up this section by completing the second half
of this query, which is to now find the particular friends that share
the common interests with the ego in the network. Example 2-12 shows how to do this by
iterating over the friendships with a double list comprehension and
processing the results. It also reminds us that we have full access to
the plotting capabilities from matplotlib
that
were introduced in Visualizing Frequency Data with Histograms.
Note
If you are using the virtual machine, your IPython Notebooks
should be configured to use plotting capabilities out of the box. If
you are running on your own local environment, be sure to have
started IPython Notebook with PyLab enabled as follows:
ipython notebook
--pylab=inline
.
# Which of your friends like things that you like?
similar_friends
=
[
(
friend
,
friend_like
[
'name'
])
for
friend
,
friend_likes
in
likes
.
items
()
for
friend_like
in
friend_likes
if
friend_like
.
get
(
'name'
)
in
common_likes
]
# Filter out any possible duplicates that could occur
ranked_friends
=
Counter
([
friend
for
(
friend
,
like
)
in
list
(
set
(
similar_friends
))
])
pt
=
PrettyTable
(
field_names
=
[
"Friend"
,
"Common Likes"
])
pt
.
align
[
"Friend"
],
pt
.
align
[
"Common Likes"
]
=
'l'
,
'r'
[
pt
.
add_row
(
rf
)
for
rf
in
sorted
(
ranked_friends
.
items
(),
key
=
itemgetter
(
1
),
reverse
=
True
)
]
"My similar friends (ranked)"
pt
# Also keep in mind that you have the full range of plotting
# capabilities available to you. A quick histogram that shows
# how many friends.
plt
.
hist
(
ranked_friends
.
values
())
plt
.
xlabel
(
'Bins (number of friends with shared likes)'
)
plt
.
ylabel
(
'Number of shared likes in each bin'
)
# Keep in mind that you can customize the binning
# as desired. See http://matplotlib.org/api/pyplot_api.html
# For example...
plt
.
figure
()
# Display the previous plot
plt
.
hist
(
ranked_friends
.
values
(),
bins
=
arange
(
1
,
max
(
ranked_friends
.
values
()),
1
))
plt
.
xlabel
(
'Bins (number of friends with shared likes)'
)
plt
.
ylabel
(
'Number of shared likes in each bin'
)
plt
.
figure
()
# Display the working plot
By now, you should be familiar with the processing. We’ve simply iterated over the variables we’ve built up so far to build a list of expanded tuples of the form (friend, friend’s like) and then used it to compute a frequency distribution to determine which friends have the most common likes. Sample results for this query in tabular form follow, and Figure 2-7 displays the same results as a histogram:
My similar friends (ranked) +----------+--------------+ | Friend | Common Likes | +----------+--------------+ | Derek | 7 | | Jamie | 4 | | Joshua | 3 | | Heather | 3 | | ... | ... | | Patrick | 1 | +----------+--------------+
As you are probably thinking, there is an abundance of questions that can be investigated with just a small sliver of data from your Facebook friends. We’ve just scratched the surface, but hopefully these exercises have been helpful in terms of framing some good starting points that can be further explored. It doesn’t take much imagination to continue down this road or to pick up with a different angle and start down an entirely different one. To illustrate just one possibility, let’s take just a moment to check out a nifty way to visualize some of your Facebook friends’ data that’s along a different line of thinking before closing out this chapter.
Analyzing mutual friendships with directed graphs
Unlike Twitter, which is an inherently open network in which you can crawl “friendships” over an extended period of time and build a large graph for any given starting point, Facebook data is much richer and rife with personally identifiable and sensitive properties about people, so the privacy and access controls make it much more closed. While you can use the Graph API to access data for the authenticating user and the authenticating user’s friends, you cannot access data for arbitrary users beyond those boundaries unless it is exposed as publicly available. One Graph API operation of particular interest is the ability to get the mutual friendships (available through the mutualfriends API and documented as part of the User object) that exist within your social network (or the social network of the authenticating user). (In other words, which of your friends are also friends with one another?) From a graph analytics perspective, analysis of an ego graph for mutual friendships can very naturally be formulated as a clique detection problem.
For example, if Abe is friends with Bob, Carol, and Dale, and Bob and Carol are also friends, the largest (“maximum”) clique in the graph exists among Abe, Bob, and Carol. If Abe, Bob, Carol, and Dale were all mutual friends, however, the graph would be fully connected, and the maximum clique would be of size 4. Adding nodes to the graph might create additional cliques, but it would not necessarily affect the size of the maximum clique in the graph. In the context of the social web, the maximum clique is interesting because it indicates the largest set of common friendships in the graph. Given two social networks, comparing the sizes of the maximum friendship cliques might provide a good starting point for analysis about various aspects of group dynamics, such as teamwork, trust, and productivity. Figure 2-8 illustrates a sample graph with the maximum clique highlighted. This graph would be said to have a clique number of size 4.
Note
Technically speaking, there is a subtle difference between a maximal clique and a maximum clique. The maximum clique is the largest clique in the graph (or cliques in the graph, if they have the same size). A maximal clique, on the other hand, is one that is not a subgraph of another clique. Figure 2-8, for example, illustrates a maximum clique of size 4, but there are several other maximal cliques of size 3 in the graph as well.
Finding cliques is an NP-complete problem (implying an
exponential runtime), but
there is an amazing Python package called NetworkX (pronounced either “networks” or “network x”)
that provides extensive graph analytics functionality, including a
find_cliques
method that delivers a solid implementation of this difficult
problem. Just be advised that it might take a long time to run as
graphs get beyond a reasonably small size (hence, the aforementioned
exponential runtime). Examples 2-13 and 2-14 demonstrate how to use Facebook
data to construct a graph of
mutual friendships and then use NetworkX to analyze the cliques within
the graph. You can install NetworkX with the predictable pip install
networkx
from a terminal.
import
networkx
as
nx
# pip install networkx
import
requests
# pip install requests
friends
=
[
(
friend
[
'id'
],
friend
[
'name'
],)
for
friend
in
g
.
get_connections
(
'me'
,
'friends'
)[
'data'
]
]
url
=
'https://graph.facebook.com/me/mutualfriends/
%s
?access_token=
%s
'
mutual_friends
=
{}
# This loop spawns a separate request for each iteration, so
# it may take a while. Optimization with a thread pool or similar
# technique would be possible.
for
friend_id
,
friend_name
in
friends
:
r
=
requests
.
get
(
url
%
(
friend_id
,
ACCESS_TOKEN
,)
)
response_data
=
json
.
loads
(
r
.
content
)[
'data'
]
mutual_friends
[
friend_name
]
=
[
data
[
'name'
]
for
data
in
response_data
]
nxg
=
nx
.
Graph
()
[
nxg
.
add_edge
(
'me'
,
mf
)
for
mf
in
mutual_friends
]
[
nxg
.
add_edge
(
f1
,
f2
)
for
f1
in
mutual_friends
for
f2
in
mutual_friends
[
f1
]
]
# Explore what's possible to do with the graph by
# typing nxg.<tab> or executing a new cell with
# the following value in it to see some pydoc on nxg
nxg
# Finding cliques is a hard problem, so this could
# take a while for large graphs.
# See http://en.wikipedia.org/wiki/NP-complete and
# http://en.wikipedia.org/wiki/Clique_problem.
cliques
=
[
c
for
c
in
nx
.
find_cliques
(
nxg
)]
num_cliques
=
len
(
cliques
)
clique_sizes
=
[
len
(
c
)
for
c
in
cliques
]
max_clique_size
=
max
(
clique_sizes
)
avg_clique_size
=
sum
(
clique_sizes
)
/
num_cliques
max_cliques
=
[
c
for
c
in
cliques
if
len
(
c
)
==
max_clique_size
]
num_max_cliques
=
len
(
max_cliques
)
max_clique_sets
=
[
set
(
c
)
for
c
in
max_cliques
]
friends_in_all_max_cliques
=
list
(
reduce
(
lambda
x
,
y
:
x
.
intersection
(
y
),
max_clique_sets
))
'Num cliques:'
,
num_cliques
'Avg clique size:'
,
avg_clique_size
'Max clique size:'
,
max_clique_size
'Num max cliques:'
,
num_max_cliques
'Friends in all max cliques:'
json
.
dumps
(
friends_in_all_max_cliques
,
indent
=
1
)
'Max cliques:'
json
.
dumps
(
max_cliques
,
indent
=
1
)
Sample output for Example 2-14 follows and illustrates that there are four cliques of size 4, with the ego (“me”) and one other person being common to the sample social network. Although the other person in common to all of the cliques is not guaranteed to be the second most highly connected person in the network, this person is likely to be among the most influential because of the relationships in common:
Num cliques: 6 Avg clique size: 3 Max clique size: 4 Num max cliques: 4 Friends in all max cliques: [ "me", "Bas" ] Max cliques: [ [ "me", "Bas", "Joshua", "Heather" ], [ "me", "Bas", "Ray", "Patrick" ], [ "me", "Bas", "Ray", "Rick" ], [ "me", "Bas", "Jamie", "Heather" ] ]
Example 2-14 could be modified in any number of ways, and clearly, there’s much more we could do than just detect the cliques. Plotting the locations of people involved in cliques on a map to see whether there’s any correlation between tightly connected networks of people and geographic locale and analyzing information in their profile data and the content in their posts might be a couple of good starting points. In the next section, we’ll learn how to put together a concise but effective visualization of mutual friendships in an intuitive graphical format.
Visualizing directed graphs of mutual friendships
D3.js is a truly state-of-the-art JavaScript toolkit that can render some beautiful visualizations in the browser with an intuitive approach that involves manipulating objects with a series of data-driven transformations. If you haven’t already encountered D3, then you really should take a few moments to browse the example gallery to get a feel for what is possible. You will be impressed.
A tutorial of how to use D3 is well outside the scope of this book, and there are numerous tutorials and discussions online about how to use many of its exciting visualizations. What we’ll do in this section before rounding out the chapter is render an interactive visualization for the mutual friendship graph introduced in the previous section. Abstractly, a graph is just a mathematical construct and doesn’t have a visual representation, but a number of layout algorithms are available that can render the graph in two-dimensional space so that it displays rather nicely (although you may need to tweak some of the layout parameters from time to time to get things just right).
NetworkX can emit a format that is directly consumable by D3, and very little work is necessary to visualize the graph since IPython Notebook can serve and render local content with an inline frame by prepending files to the path. Example 2-15 demonstrates how to serialize out the graph for rendering, and Example 2-16 uses IPython Notebook to serve up a web page displaying an interactive graph like the one shown in Figure 2-9. The HTML that embeds the necessary style and scripts is included with the IPython Notebook for this chapter in a subfolder of its resources called viz.
Note
You can access files for reading or writing with IPython Notebook by using relative or absolute paths; however, serving files as web pages requires you to prepend the special cue files to the path.
from
networkx.readwrite
import
json_graph
nld
=
json_graph
.
node_link_data
(
nxg
)
json
.
dump
(
nld
,
open
(
'resources/ch02-facebook/viz/force.json'
,
'w'
))
from
IPython.display
import
IFrame
from
IPython.core.display
import
display
# IPython Notebook can serve files and display them into
# inline frames. Prepend the path with the 'files' prefix.
viz_file
=
'files/resources/ch02-facebook/viz/force.html'
display
(
IFrame
(
viz_file
,
'100%'
,
'600px'
))
Closing Remarks
The goal of this chapter was to teach you about the Graph API, how the Open Graph protocol can create connections between arbitrary web pages and Facebook’s Social Graph, and how to programmatically query the Social Graph to gain insight into Facebook pages and your own social network. If you’ve worked through the examples in this chapter, you should have little to no trouble probing the Social Graph for answers to questions that may prove valuable. Keep in mind that as you explore a data set as enormous and interesting as Facebook’s Social Graph, you really just need a good starting point. As you investigate answers to an initial query, you’ll likely follow a natural course of exploration that will successively refine your understanding of the data and get you closer to the answers that you are looking for.
The possibilities for mining data on Facebook are immense, but be respectful of privacy, and always comply with Facebook’s terms of service to the best of your ability. Unlike data from Twitter and some other sources that are inherently more open in nature, Facebook data can be quite sensitive, especially if you are analyzing your own social network. Hopefully, this chapter has made it apparent that there are many exciting possibilities for what can be done with social data, and that there’s enormous value tucked away on Facebook.
Note
The source code outlined for this chapter and all other chapters is available at GitHub in a convenient IPython Notebook format that you’re highly encouraged to try out from the comfort of your own web browser.
Recommended Exercises
Analyze data from the fan page for something you’re interested in on Facebook and attempt to analyze the natural language in the comments stream to gain insights. What are the most common topics being discussed? Can you tell if fans are particularly happy or upset about anything?
Select two different fan pages that are similar in nature and compare/contrast them. For example, what similarities and differences can you identify between fans of Chipotle Mexican Grill and Taco Bell? Can you find anything surprising?
Analyze your own friendships and try to determine if your own network has any natural rallying points or common interests. What is the common glue that binds your network together?
The number of Facebook objects available to the Graph API is enormous. Can you examine objects such as photos or checkins to discover insights about anyone in your network? For example, who posts the most pictures, and can you tell what are they about based on the comments stream? Where do your friends check in most often?
Use histograms (introduced in Visualizing Frequency Data with Histograms) to further slice and dice your friends’ likes data.
Use the Graph API to collect other kinds of data and find a suitable D3 visualization for rendering it. For example, can you plot where your friends live or where they grew up on a map? Which of your friends still live in their hometowns?
Harvest some Twitter data, construct a graph, and analyze/visualize it using the techniques introduced in this chapter.
Try out some different similarity metrics to compute your most similar friendships. The Jaccard Index is a good starting point. See Analyzing Bigrams in Human Language for some information that may be helpful here.
[2] Internet usage statistics show that the world’s population is estimated to be approximately 7 billion, with the estimated number of Internet users being almost 2.5 billion.
[3] Throughout this section describing the implementation of OGP, the term Social Graph is generically used to refer to both the Social Graph and Open Graph, unless explicitly emphasized otherwise.
[4] Throughout this section, keep in mind that the responses to these queries reflect data for the first edition of the book, since that’s what’s available at the time of this writing. Your exact query results may vary somewhat.
[5] This summary was generated circa March 2013.
[6] Remember, the ego of a social network is its logical center or basis. In this case, the ego of the network is the author of this book—that is, the person whose social network we are examining.
[7] Rich Froning Jr. is a well-known (and outspokenly Christian) CrossFit athlete.
Get Mining the Social Web, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.