O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  



Mining Popular Books and Movies in Social Networks
I describe how I generated a list of most cited books and movies from the profiles of the people who are connected to me at Orkut, Google's online social-networking service.

Contributed by:
Upster
[05/11/05 | Discuss (0) | Link to this hack]

Motivation

People generally mention only the movies and books they had enjoyed most in their profiles. So for a given set of people, a book that is mentioned the most by people in their profiles would definitly be more worth reading than the one that is mentioned comparatively lesser number of times.

So what did I do?

Orkut does not provide any Web-Services at this time that allow for writing such a tool, so I wrote a script in Perl that used the WWW::Mechanize module available from CPAN and parsed the HTML of the webpages that contained the required data.

The algorithm to do this is quite simple. You, your friends, friends of your friends, and so-on, can be arranged in a directed graph structure such that the vertices of the graph correspond to the users while the edges correspond to the function has-friend. So, if B is a friend of A, simply draw a directed edge from A to B.

The problem now is to simply visit all the vertices in the graph and build a hash of movie-names and the citation counts by processing the profile page of the user corresponding to the vertex currently being processed.

But hey, there is a slight problem!

Orkut on its home-page tells me that:

You are connected to 5283820 people through 21 friends.

Assuming that I am able to fetch and analyze the profile of each person in 1.5 seconds (Slow Network IO), it would take a total of about 92 days to analyze all the persons that I am connected to at Orkut. 92 days!

Hence, I would want to start at me in the graph and be able to restrict the analysis to people who are at a maximum distance max_distance from me. Additionaly, I should be able to provide max_distance to the algorithm.

The Algorithm

We can visit all the vertices in the graph by adopting the depth-first-retrieval technique starting at me in the graph, but with a slight modification. We would like to stop the DFS if the distance of the current vertex (user) in the graph is greater than or equal to max_distance (I am at a distance of zero from myself). Hence, we can have a recursive algorithm that would look something like:


sub find_movies {
    my $uid              = shift;
    my $current_distance = shift;
    my $max_distance     = shift;

    process_user($uid);
    $visited{$uid} = 1;

    # Process friends of current-user, only if his distance is less
    # than the maximum allowed distance from the start-user.

    if ($current_distance < $max_distance) {
        foreach (find_friends_of($uid)) {
            unless ($visited{$_}) {
                find_movies($_, $current_distance + 1, $max_distance);
            }
        }
    }
}

The process_user function processes the profile page (Profile.aspx) corresponding to the specified user and then updates the hash of movies. The find_friends_of function process the network of friends mentioned at Orkut (FriendsNet.aspx) and returns a list containing the identifiers of the users that are listed as friends of the specified user at Orkut.

find_movies will be initially invoked as find_movies($your_user_id, 0, $max_distance) and the visited hash would be initially empty. After the find_movies function call returns, we will have to sort the values in the movies hash that would contain all the movies mapped to their citation counts and print them out.

Results

Here is a list of most popular books, along with their citation counts, among the people who are at a maximum distance of 2 from me on Orkut.

  1. Fountain head (21)
  2. Da vinci code (16)
  3. The godfather (15)
  4. Catch 22 (13)
  5. Harry potter series (12)
  6. To kill a mocking bird (11)
  7. The alchemist (11)
  8. Lord of the rings (11)
  9. Gone with the wind (9)
  10. Catcher in the rye (9)
  11. Lotr (7)
  12. Five point someone (7)
  13. Love story (6)
  14. The class (5)
  15. Atlas shrugged (5)
  16. Midnight's children (4)
  17. Kane and abel (4)
  18. Siddhartha (3)
  19. Made in japan (3)
  20. Inscrutable americans (3)
  21. God of small things (3)
  22. Freedom at midnight (3)
  23. Doctors (3)


Cloud of Books

Here is the same list of books arranged in a cloud like fashion such that the more popular books appear larger than others. Each of the names in the cloud is a link and would start an Amazon search for that name.

atlas shrugged catch 22 catcher in the rye da vinci code doctors five point someone fountain head freedom at midnight god of small things gone with the wind harry potter series inscrutable americans kane and abel lord of the rings lotr love story made in japan midnight's children siddhartha the alchemist the class the godfather to kill a mocking bird 



Conclusion

There is a lot of precious information that can be discovered by aggregating the information in the profiles of users. This is just one example. The results were very satisfactory, and I strongly believe that there is a whole wealth of information that can be discovered by aggregating the user-data available in social-networks. Perhaps, Google should consider providing statistics to each user in Orkut, about the most popular books and movies among his/her friends, and link each of them to Amazon, or another shopping website that exposes Web-Services.

See also: The original article is available at http://upster.blogspot.com/2005/04/5-must-watch-movies-thanks-orkut.html


The Cloud of Books was originally presented here.


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.