Credit: Xavier Defrang
Sometimes regular expressions afford the fastest solution even in
cases where their applicability is anything but obvious. In
particular, the sub
method of re
objects
makes regular expressions a good way to perform string substitutions.
Here is how you can produce a result string from an input string
where each occurrence of any key in a given dictionary is replaced by
the corresponding value in the dictionary:
# requires Python 2.1 or later
from _ _future_ _ import nested_scopes
import re
# the simplest, lambda-based implementation
def multiple_replace(adict, text):
# Create a regular expression from all of the dictionary keys
regex = re.compile("|".join(map(re.escape, adict.keys( ))))
# For each match, look up the corresponding value in the dictionary
return regex.sub(lambda match: adict[match.group(0)], text)
A more powerful and flexible
approach is to wrap the dictionary into a callable object that
directly supports the lookup and replacement idea, which you can use
directly as the callback in the sub
method. This
object-oriented approach is more flexible because the callable object
can keep its own state and therefore is easily extensible to other
tasks. In Python 2.2 and later, you can create a class for this
object by extending the dict
built-in type, while
in older Python versions you must fall back on
UserDict.UserDict
(built-in types were not
subclassable in older versions). A
try
/except
lets us easily write
code that works optimally on both old and new versions of Python:
try: dict except: from UserDict import UserDict as dict class Xlator(dict): """ All-in-one multiple-string-substitution class """ def _make_regex(self): """ Build re object based on the keys of the current dictionary """ return re.compile("|".join(map(re.escape, self.keys( )))) def _ _call_ _(self, match): """ Handler invoked for each regex match """ return self[match.group(0)] def xlat(self, text): """ Translate text, returns the modified text. """ return self._make_regex( ).sub(self, text)
This
recipe shows how to use the Python standard re
module to perform single-pass multiple-string substitution using a
dictionary. Let’s say you have a dictionary-based,
one-to-one mapping between strings. The keys are the set of strings
(or regular-expression patterns) you want to replace, and the
corresponding values are the strings with which to replace them. You
can perform the substitution by calling re.sub
for
each key/value pair in the dictionary, thus processing and creating a
new copy of the whole text several times, but it is clearly better to
do all the changes in a single pass, processing and creating a copy
of the text only once. Fortunately,
re.sub
’s callback facility makes
this better approach quite easy.
First, we have to build a regular expression from the set of keys we
want to match. Such a regular expression is a pattern of the form
“a1|a2|...|an"
and can easily be generated using a one-liner, as shown in the
recipe. Then, instead of giving re.sub
a
replacement string, we call it with a callback argument.
re.sub
calls this object for each match, with a
re.MatchObject
as its only argument, and expects the
replacement string as the call’s result. In our
case, the callback just has to look up the matched text in the
dictionary and return the corresponding value.
The recipe has two implementations: one is
lambda
-based, and the other uses a callable,
dictionary-like object. The second option is better if you want to
perform additional processing on each match (e.g., build a histogram
of the number of times each possible substitution is actually
performed) or if you just dislike lambda
. Another
potential advantage of the class-based approach is performance. If
you know that the translation dictionary is static, and you must
apply the same translation to several input strings, you can move the
_make_regex
call from the xlat
method, where it’s currently done, to an _ _init_ _
method, to avoid repeatedly preparing and
compiling the regular expression.
Here’s a usage example for each half of this recipe.
We would normally have it as a part of the same
.py
source file as the function and class shown
in the recipe, so it is guarded by the traditional Python idiom that
runs it if and only if the module is called as a main script:
if _ _name_ _ == "_ _main_ _": text = "Larry Wall is the creator of Perl" adict = { "Larry Wall" : "Guido van Rossum", "creator" : "Benevolent Dictator for Life", "Perl" : "Python", } print multiple_replace(adict, text) xlat = Xlator(adict) print xlat.xlat(text)
Substitutions such as those performed by this recipe are often
intended to operate on entire words, rather than on arbitrary
substrings. Regular expressions are good at picking up the beginnings
and endings of words, thanks to the special sequence
r'\b'
. Thus, we can easily make a version of the
Xlator
class that is constrained to substitute
only entire words:
class WordXlator(Xlator): """ An Xlator version to substitute only entire words """ def _make_regex(self): return re.compile( r'\b'+r'\b|\b'.join(map(re.escape, self.keys( )))+r'\b')
Note how much easier it is to customize Xlator
than it would be to customize the multiple_replace
function. Ease of customization by subclassing and overriding helps
you avoid copy-and-paste coding, and this is another excellent reason
to prefer object-oriented structures over simpler procedural ones. Of
course, just because some functionality is packaged up as a class
doesn’t magically make it customizable in just the
way you want. Customizability also takes some foresight in dividing
the functionality into separately overridable methods that correspond
to the right pieces of overall functionality. Fortunately, you
don’t have to get it right the first time; when code
does not have the optimal internal structure for the task at hand
(e.g., reuse by subclassing and selective overriding), you can and
should refactor the code so that its internal structure serves your
needs. Just make sure you have a suitable battery of tests ready to
run to ensure that your refactoring hasn’t broken
anything, and then you can refactor to your heart’s
content. See http://www.refactoring.com for more
information on the important art and practice of refactoring.
Documentation for the re
module in the
Library Reference; the Refactoring home page
(http://www.refactoring.com).
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.