Filtering a String for a Set of Characters

Credit: Jürgen Hermann, Nick Perkins

Problem

Given a set of characters to keep, you need to build a filtering functor (a function-like, callable object). The specific functor you need to build is one that, applied to any string s, returns a copy of s that contains only characters in the set.

Solution

The string.maketrans function and translate method of string objects are fast and handy for all tasks of this ilk:

import string

# Make a reusable string of all characters
_allchars = string.maketrans('', '')

def makefilter(keep):
    """ Return a functor that takes a string and returns a partial copy of that
        string consisting of only the characters in 'keep'.
    """
    # Make a string of all characters that are not in 'keep'
    delchars = _allchars.translate(_allchars, keep)

    # Return the functor, binding the two strings as default args
    return lambda s, a=_allchars, d=delchars: s.translate(a, d)

def canonicform(keep):
    """ Given a string, considered as a set of characters, return the
        string's characters as a canonic-form string: alphabetized
        and without duplicates.
    """
    return makefilter(keep)(_allchars)

if _ _name_ _ == '_ _main_ _':
    identifier = makefilter(string.letters + string.digits + '_')
    print identifier(_allchars)

Discussion

The key to understanding this recipe lies in the definitions of the translate and maketrans functions in the string module. translate takes a string and replaces each character in it with the corresponding character in the translation table passed in as the second argument, deleting the characters specified in the third argument. maketrans is a utility routine that helps create the translation tables.

Efficiency is vastly improved by splitting the filtering task into preparation and execution phases. The string of all characters is clearly reusable, so we build it once and for all when this module is imported. That way, we ensure that each filtering functor has a reference to the same string of all characters, not wasting any memory. The string of characters to delete depends on the set of characters to keep, so we build it in the makefilter factory function. This is done quite rapidly using the translate method to delete the characters to keep from the string of all characters. The translate method is very fast, as are the construction and execution of these useful little functors. The solution also supplies an extremely simple function to put any set of characters, originally an arbitrary string, into canonic-string form (alphabetically sorted, without duplicates). The same trick encapsulated in the canonicform function is also explicitly used in the test code that is executed when this runs as a script.

Of course, you don’t have to use lambda (here or anywhere else). A named function local to the factory function will do just as well. In other words, this recipe works fine if you change makefilter’s return statement into the following two statements:

def filter(s, a=_allchars, d=delchars): return s.translate(a, d)
return filter

Many Pythonistas would consider this clearer and more readable.

This isn’t a big issue, but remember that lambda is never necessary. In any case in which you find yourself straining to fit code into a lambda’s limitations (i.e., just an expression, with no statements allowed), you can and should always use a local named function instead, to avoid all the limitations and problems.

With Python 2.2, or Python 2.1 and a from _ _future_ _ import nested_scopes, you get lexically nested scopes, so that if you want to, you can avoid binding _allchars and delchars as default values for arguments in the returned functor. However, it is (marginally) faster to use this binding anyway: local variables are the fastest kind to access, and arguments are nothing but prebound local variables. Globals and names from nested scopes require a little more effort from the interpreter (and sometimes, perhaps more significantly, from a human being who is reading the code). This is why we bind _allchars as argument a here despite the fact that, in any release of Python, we could have just accessed it as a global variable.

See Also

Documentation for the maketrans function in the string module in the Library Reference.

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.