Probabilistic data structures in Python

Use approximations with error bounds to trade-off system resources, e.g., memory or compute time -- especially for large-scale analytics and streaming data.

By Paco Nathan

March 21, 2016

Paco Nathan (source: O'Reilly Media)

Probabilistic Data Structures

Probabilistic Data Structures represent a relatively new area of algorithms.
Notably, the mathematician Philippe Flajolet gets credited with early work, though a number of different people contributed in the examples shown here.

You may also hear the terms approximation algorithms, sketch algorithms, or online algorithms used to describe the same body of work.
These are proving useful for analytics with large-scale data and streaming applications, and especially for use cases that have both.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

The big idea here is that these approaches provide approximations with error bounds.
The amount of acceptable error is generally a trade-off for less system resources, such as memory or compute time.
Parameters adjust those error bounds, and also determine the resources required for a given application.

This notebook shows Python sample code for a few of the more well-known examples of probabilistic data structures, including:

algorithm	usage
HyperLogLog	set cardinality
BloomFilter	set membership
MinHash	set similarity
Count-Min Sketch	frequency summaries
t-Digest	streaming quantiles

Data Sets

We’ll need some interesting data to help illustrate how these algorithms work.
Let’s create a data set from the text of Jabberwocky by Lewis Carroll.

jabber_text = """
`Twas brillig, and the slithy toves
  Did gyre and gimble in the wabe:
All mimsy were the borogoves,
  And the mome raths outgrabe.

"Beware the Jabberwock, my son!
  The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
  The frumious Bandersnatch!"
He took his vorpal sword in hand:
  Long time the manxome foe he sought --
So rested he by the Tumtum tree,
  And stood awhile in thought.
And, as in uffish thought he stood,
  The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
  And burbled as it came!
One, two! One, two! And through and through
  The vorpal blade went snicker-snack!
He left it dead, and with its head
  He went galumphing back.
"And, has thou slain the Jabberwock?
  Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!'
  He chortled in his joy.

`Twas brillig, and the slithy toves
  Did gyre and gimble in the wabe;
All mimsy were the borogoves,
  And the mome raths outgrabe.
"""

For comparison, we’ll use text from another poem, Daylight Saving by Dorothy Parker.

parker_text = """
My answers are inadequate
To those demanding day and date
And ever set a tiny shock
Through strangers asking what's o'clock;
Whose days are spent in whittling rhyme-
What's time to her, or she to Time?
"""

Next, let’s define a simple Python function to construct lists of words from these poems.

import re

def clean_words (text):
    return filter(lambda x: len(x) > 0, re.sub("[^A-Za-z]", " ", text).split(" "))

jabber_words = clean_words(jabber_text.lower())
print jabber_words

parker_words = clean_words(parker_text.lower())
print parker_words

For some comparisons we’ll also need a list of the unique words in one of the poems.

jabber_uniq = sorted(set(jabber_words))
print jabber_uniq

Set Cardinality

Cardinality is another way of describing how to count the elements in a set.
Let’s use HyperLogLog to implement a probabilistic counter. We’ll count the total number of words in one of the poems.

The main idea here is that when you have a very large collection of things, counting becomes a problem.
In Python, the long integers have unlimited precision, so you can count really large sets as long as you have a lot of memory to use.
What if you don’t want to use up your application’s memory?
What if you need that memory for other purposes?
For example, imagine needing to compare the number of followers on Twitter, when the comparision is between an average user who has about 200 followers and Lady Gaga who has more than 50 million followers.
Do you really care whether Lady Gaga has 50,000,000 or 50,000,100 followers?
The difference is noise anyway, so why waste lots of application memory?
What if there are many celebrities to count and compare?
Instead, approximate.

The HyperLogLog algorithm was first described in a paper by Flajolet in 2007.
This example uses a Python implementation by Vasily Evseenko.

Note that the code in this example is configured to accept a 1% counting error. We’ll compare the approximated count with the actual count to measure the observed error rate.

import hyperloglog

# accept 1% counting error
hll = hyperloglog.HyperLogLog(0.01)

for word in jabber_words:
    hll.add(word)

print "prob count %d, true count %d" % (len(hll), len(jabber_uniq))
print "observed error rate %0.2f" % (abs(len(hll) - len(jabber_uniq)) / float(len(jabber_uniq)))

Great, those results show how the bounded error rate works as expected.

Set Membership

Set Membership determines whether a specified element is known to be within a given set.
In other words, once we define the set, we can run membership queries.
Let’s use a Bloom filter for approximate membership queries.

One interesting property of a BloomFilter is that false negatives are not possible.
So the main idea here is that we can load a set, then test elements to see if they are included in the set.
Some small number of elements that aren’t in the set will test positive.

Suppose you have a database of known customers for your web site, but the queries in the database are expensive.
If many people visit your web site, but only a fraction are customers, then you could use a BloomFilter of the known customers to test whether you even need to query the database.
That could reduce the cost of queries.

The BloomFilter was first described in a paper by Burton Bloom in 1970.
This example uses a Python implementation by Jay Baird.

In this case, we’ll load the words in Daylight Saving, then test the words in Jabberwocky.

from pybloom import BloomFilter

bf = BloomFilter(capacity=1000, error_rate=0.001)

for word in parker_words:
    bf.add(word)

intersect = set([])

for word in jabber_words:
    if word in bf:
        intersect.add(word)

print intersect

Eight words in common.
Similarly, this approach can be used in database queries, when one side of JOIN is much smaller than the other.

Set Similarity

Thinking about that example above, what if you simply needed an estimate of the number the words in common between the two poems?
What if you have many sets and need quick comparisons about how similar they are?

For example, suppose you want to test a large number of documents for potential plagiarism?
Comparing the similarity of the documents would provide a quick estimate — enough to weed out the documents that were clearly not similar to each other.

The MinHash algorithm was first described in a paper by Andrei Broder in 1997.
This example uses a Python implementation by Eric Zhu.

Here we’ll estimate the similarity between the words in the two poems.

from hashlib import sha1
from datasketch import MinHash

def mh_digest (data):
    m = MinHash(num_perm=512)

    for d in data:
        m.digest(sha1(d.encode('utf8')))

    return m

m1 = mh_digest(set(jabber_words))
m2 = mh_digest(set(parker_words))

print "Jaccard simularity %f" % m1.jaccard(m2), "estimated"

s1 = set(jabber_words)
s2 = set(parker_words)
actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))

print "Jaccard simularity %f" % actual_jaccard, "actual"

An estimate of 7.4% similarity versus a measure of 7.0% actual similarity, based on the Jaccard index.
In any case, we’re fairly certain that Dorothy Parker didn’t plagiarize Lewis Carroll.

Frequency Summaries

Frequency Summaries are another way of describing the frequency of events in a data stream.
Think: leaderboards used for online competitions, etc.

The main idea is that we want to measure and compare which events are occurring most frequently, and we probably don’t care about the events that occur less often.
The precision of the frequencies isn’t particularly imporant.

We’ll use Count-min sketch for frequency summaries, to implement a probabilistic word count on one of the poems.

The Count-Min Sketch algorithm was first described in a paper by Graham Cormode and S. Muthukrishnan in 2005.
This example uses a Python implementation by Isaac Sijaranamual.

from collections import Counter
from yacms import CountMinSketch

counts = Counter()

# table size=1000, hash functions=10
cms = CountMinSketch(200, 3)

for word in jabber_words:
    counts[word] += 1
    cms.update(word, 1)

Let’s take a look at the counts for common words…

for word in ["the", "and", "he", "that"]:
    print "instances of the word `%s` %d" % (word, cms.estimate(word))

One practical example would be in language detection: comparing frequencies for the most commonly occuring words is a simply way to predict the language in which a document was written.

Next, let’s look at where the estimates in the sketch differed from actual counts:

for e in counts:
    if counts[e] != cms.estimate(e):
        print "missed '%s' counter: %d, sketch: %d" % (e, counts[e], cms.estimate(e))

Okay, that was way better than writting 50 lines of Java code for a Hadoop application.

Streaming Quantiles

Imagine that you have a large stream of data.
Perhaps you have an application that must measure that stream continually, where stopping to count simply isn’t an option.
Suppose you’re monitoring ecommerce transations, trying to detect credit card fraud?
Some measure become important (e.g., average order price).
Approximating metrics on a stream is a great way to build applications that are more robust and operate in real-time.

The t-Digest algorithm was first described by Ted Dunning and Otmar Ertl in 2013.
This example uses a Python implementation by Trademob GmbH.

In this case, we’ll skip the poems.
Instead let’s generate a stream of random numbers and look at its head, median, and tail.

from tdigest import TDigest
import random

td = TDigest()

for x in xrange(0, 1000):
    td.add(random.random(), 1)

for q in [0.05, 0.5, 0.95]:
    print "%f @ %f" % (q, td.quantile(q))

The pseudo-random number generator in Python provides something close to a uniform distibution.
In other words, it’s flat.

The Big Picture

These methods for approximation present a very different way of thinking about how to build applications and architectures for Big Data. For example, one immediate outcome is that we can query different time ranges of data by composing the hashes. Rather than have one SQL query for day/day metrics, then another SQL query for week/week metrics, simply compose the hashed approximations for the seven days of the week. That implies much less data being stored, and significantly less computation required for reporting.

In the big picture, the implications of these algorithms cut much deeper. Looking at the history of Statistics over the past two centuries, much of the emphasis had been on defensibility. One would collect a data set, fit it to a known probability distribution, then make inferences from that model. The math was rigorous, it could generally hold up in court during expert testimony, or say during FDA drug trials for new pharmaceuticals.

However, that approach to working with data also led to our notion of a batch window. Capture data during one phase, fit to a model during another phase, then use the inferences in a third phase — with lots of time lag between phases.

About 15 years ago, the industry crossed a threshold, largely due to rapidly increasing data rates at early ecommerce firms and their automated use of machine learning with that data for their products: recommender systems, search engines, anti-fraud classifiers, etc. With these kinds of use cases, there was no time to have an expert make inferences based on data modeling techniques that would be defensibly in court. Rather, the business case was to use data in near real-time, automating ML and inferences. In other words, predictability in lieu of defensibility. Batch windows became a bottleneck. Our concepts of applied mathematics had become a bottleneck. Leo Breiman captured much of that fundamental sea change in the 2001 paper,
Statistical Modeling: The Two Cultures. I highly recommend reading that.

By using probabilistic data structures, we invert that bottleneck. Calculate error bounds in advance, probabilistically, so that well-formed approximations get built into data collection directly. Then we don’t have to stop to take a batch window, stop to fit a model, etc. Moreover, we can sample the results that we need, at any point within a real-time stream.

The results have huge implications for how we leverage system resources. Trade-off a few percentage points of error bounds, and reduce the required memory footprint by a couple orders of magnitude. What’s not to like about that?

A phrase has been used among data science teams at Twitter: “Hash, don’t sample” by Oscar Boykin.

Selected Resources

Check out this O’Reilly webcast, Probabilistic Data Structures and Breaking Down Big Sequence Data by C. Titus Brown, about using approximation algorithms in genomics.

Two of the better introductions to the math involved here are:

Add ALL The Things by Avi Bryant
Probabilistic Data Structures for Web Analytics and Data Mining by Ilya Katsov

For code in Scala, the Algebird library by Avi Bryant and Oscar Boykin provides an excellent framework for working with approximation algorithms.

BlinkDB is a project that leverages a variety of approximation techniques for accelerating SQL queries on data at scale, and has been incorporated into Spark SQL.

Also see Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman, and associated with that there’s a highly recommended bi-annual conference MMDS.

Post topics: Data science