Credit: Tim Peters
You have a sequence that may include duplicates, and you need to remove the duplicates in the fastest possible way without knowing much about the properties of the items in the sequence. You do not care about the order of items in the resulting sequence.
The key is to try several approaches, fastest first, and use
try
/except
to handle the
failing cases of the fastest approaches:
def unique(s): """ Return a list of the elements in s in arbitrary order, but without duplicates. """ # Get the special case of an empty s out of the way very rapidly n = len(s) if n == 0: return [] # Try using a dict first, because it's the fastest and will usually work u = {} try: for x in s: u[x] = 1 except TypeError: del u # Move on to the next method else: return u.keys( ) # Since you can't hash all elements, try sorting, to bring equal items # together and weed them out in a single pass try: t = list(s) t.sort( ) except TypeError: del t # Move on to the next method else: assert n > 0 last = t[0] lasti = i = 1 while i < n: if t[i] != last: t[lasti] = last = t[i] lasti += 1 i += 1 return t[:lasti] # Brute force is all that's left u = [] for x in s: if x not in u: u.append(x) return u
The purpose of this recipe’s
unique
function is to take a sequence s
as an argument
and return a list of the items in s
in arbitrary
order, but without duplicates. For example, calling
unique([1, 2, 3, 1, 2, 3])
returns an arbitrary
permutation of [1, 2, 3]
, calling
unique("abcabc")
returns an arbitrary permutation
of ["a", "b", "c"]
, and calling
unique(([1,
2], [2, 3], [1, 2]))
returns an arbitrary permutation of [[2, 3], [1, 2]]
.
The fastest way to remove duplicates from a sequence depends on some
pretty subtle properties of the sequence elements, such as whether
they’re hashable and whether they support full
comparisons. The unique
function shown in this
recipe tries three methods, from fastest to slowest, letting runtime
exceptions pick the best method available for the sequence at hand.
For best speed, all sequence elements should be hashable. When they
are, the unique
function will usually work in
linear time (i.e., O(N)
, or directly proportional
to the number of elements in the input, which is a good and highly
scalable performance characteristic).
If it turns out that hashing the elements (using them as dictionary
keys) is not possible, the next best thing is that the elements enjoy
a total ordering. If list(s).sort( )
doesn’t raise a TypeError
, we can
assume that s
’s elements do enjoy
a total ordering. Then unique
will usually work in
O(N x
log(N)) time. Note that
Python lists’ sort
method was
specially designed to be highly efficient in the presence of many
duplicate elements, so the sorting approach may be more effective in
Python than elsewhere.
If sorting also turns out to be impossible, the sequence elements
must at least support equality testing, or else the very concept of
duplicates can’t really be meaningful for them. In
this case, unique
works in quadratic time (i.e.,
O(N2), or proportional to the square of the
number of elements in the input, which is not very scalable, but is
the least of all evils, given the sequence item’s
obviously peculiar nature if we get all the way to this subcase).
This is a pure example of how algorithm efficiency depends on the strength of the assumptions you can make about the data. Of course, you could split this into three distinct functions and directly call the one that best meets your needs. In practice, however, the brute-force method is so slow for large sequences that nothing measurable is lost by simply letting the function as written try the faster methods first.
If you need to preserve the same order of items in the output sequence as in the input sequence, see Recipe 17.5.
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.