Credit: Iuri Wickert, Duncan Grisby, Steve Holden, Alex Martelli
You need to consume, in random order, the items of a rather long list, and the most direct approach is painfully slow.
While it’s a common mistake to be overly concerned with speed, you should not ignore the different performances of various algorithms. Suppose we must process all of the items in a long list in random order, without repetition. For example:
import random # an example of a processing function def process(datum): print datum # an example of a set of data to process data = range(10000)
The simplest version is very slow:
def simple( ): while data: # returns element, not index (doesn't help) elem = random.choice(data) # needs to search for the element throughout the list! data.remove(elem) process(elem)
Here is a faster version:
def faster( ): while data: index = random.randrange(len(data)) elem = data[index] # direct deletion, no search needed del data[index] process(elem) # the same, but preserving the data list def faster_preserve( ): aux = range(len(data)) while aux: posit = random.randrange(len(aux)) index = aux[posit] elem = data[index] # alters the auxiliary list only del aux[posit] process(elem)
However, the key improvement is to switch to an O(N) algorithm:
def improved( ): size = len(data) while size: index = random.randrange(size) elem = data[index] data[index] = data[size-1] size = size - 1 process(elem)
Of course, you can also implement a version of this that preserves the data list.
But the winner is the version that appears to be the simplest:
def best( ): random.shuffle(data) for elem in data: process(elem) # or, if you need to preserve the data list's original ordering: def best_preserve( ): aux = list(data) random.shuffle(aux) for elem in aux: process(elem)
The simplest, most direct way of consuming a list in a random fashion
is painfully slow for lists with a few hundred elements. While it is
tempting to use the simple, clear
choice
/remove
combination, as
in the simple
function, this is a bad choice,
because remove
must linearly search through the
list to find the element to delete. In other words, the overall
performance is O(N2), with
a large multiplicative constant.
Fortunately, there are equally simple (or simpler), but much faster,
ways to do this. The faster
function, using
randrange
/del
to generate the
random indexes for the list, can skip the costly search for the
deletion. If it’s important to preserve the input
list, you can use a disposable auxiliary list to generate the data
list indexes, as in the faster_preserve
function.
However, del anylist[x]
for a random x is still
O(N), so overall
performance is still O(N2),
albeit with a much smaller multiplicative constant. Incidentally, the
pseudorandom order in which items are processed is not the same with
the various approaches, even if random
is seeded
in the same way. All of the orderings are equally pseudorandom,
though.
Pursuing O(N) performance,
one possibility is not to delete an item from a list at all, but
rather to overwrite it with the last item and decrease at each step
the number of items from which we’re choosing. The
improved
function takes this tack and benefits
greatly from it, in terms of performance.
The fastest approach, however, is to shuffle
the
data (or an auxiliary copy of it) once, at the start, then just
process the items sequentially in the resulting pseudorandom order.
The nice thing about this approach, shown in the
best
and best_preserve
functions, is that it’s actually the simplest of
all.
On lists of 10,000 items, as shown in this recipe, the overhead
(meaning pure overhead, using a do-nothing processing function) of
simple
is about 13 or 14 times more than that of
faster
and faster_preserve
.
Those functions, in turn, have over twice as much overhead as
improved
, best
, and
best_preserve
. On lists of 100,000 items,
faster
and faster_preserve
become about 15 times slower than improved
,
best
, and best_preserve
. The
latter two have, for every list size, about 20%-30% less overhead
than improved
—a very minor issue, although
their utter simplicity clearly does make them
deserve their names.
While an improvement of 25%, or even a factor of 2, may be neglected without substantial cost for the performance of your program as a whole, the same does not apply to an algorithm that is 10 or more times as slow as it could be. Such terrible performance is likely to make that program fragment a bottleneck, all by itself, particularly when we’re talking about O(N2) versus O(N) behavior.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.