Credit: John Jensen, Fred Bremmer
You need to produce ascending- or descending-count histograms, such as the most or least common words in a file, popular pages on a web site, etc.
Histogramming is basically an issue of counting item occurrences (a Python dictionary makes this quite easy) and sorting by the counts. In Python, the two actions, and the dictionary that holds the counts, are easily wrapped into a class:
class Counter: def _ _init_ _(self): self.dict = {} def add(self, item): count = self.dict.get(item, 0) self.dict[item] = count + 1 def counts(self, desc=None): """ Returns list of keys sorted by values. Pass desc as 1 if you want a descending sort. """ result = map(None, self.dict.values(), self.dict.keys( )) result.sort( ) if desc: result.reverse( ) return result
The
add
method shows the normal Python idiom for counting occurrences of
arbitrary (but hashable) items, using a dictionary to hold the
counts. The counts
method is where all the action is. It
takes the dictionary and produces an ascending or descending sort of
keys by values, returning a list of pairs representing the desired
histogram. The
map
call takes advantage of an interesting but little-known tidbit of
documented Python behavior. While the values
and
keys
methods of a dictionary return their results
in an arbitrary order, the ordering is compatible when the two
methods are called without any intervening modification to the
dictionary object. In other words, d[d.keys( )[x]]
is d.values(x)
for any valid index
x
. This lets us elegantly zip values and keys with
the value as the first item and the key as the second item in each
pair, so the
sort
method will work right (by using map
with a first
argument of None
rather than
zip
, we keep compatibility with 1.5.2).
Here is an example:
sentence = "Hello there this is a test. Hello there this was a test, " \ "but now it is not." words = sentence.split( ) c = Counter( ) for word in words: c.add(word) print "Ascending count:" print c.counts( ) print "Descending count:" print c.counts(1)
This produces:
Ascending count: [(1, 'but'), (1, 'it'), (1, 'not.'), (1, 'now'), (1, 'test,'), (1, 'test.'), (1, 'was'), (2, 'Hello'), (2, 'a'), (2, 'is'), (2, 'there'), (2, 'this')] Descending count: [(2, 'this'), (2, 'there'), (2, 'is'), (2, 'a'), (2, 'Hello'), (1, 'was'), (1, 'test.'), (1, 'test,'), (1, 'now'), (1, 'not.'), (1, 'it'), (1, 'but')]
If you give up on 1.5.2 compatibility and use a list comprehension
instead of the map
call, the code arguably becomes
a little easier to read:
def counts(self, desc=None):
result = [(val, key) for key, val in self.dict.items( )]
result.sort( )
if desc: result.reverse( )
return result
However, if this issue ever arises in a spot of your program that is
a critical speed bottleneck, you should measure performance
accurately for each version of counts
. Often (but
not always), map
displays surprisingly good
performance characteristics when compared to list comprehensions (at
least when no lambda
is involved in the use of
map
).
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.