Search the Catalog
Learning Python

Learning Python

By Mark Lutz & David Ascher
1st Edition April 1999
1-56592-464-9, Order Number: 4649
384 pages, $34.95

Sample Chapter 9: Common Tasks in Python


In this chapter:
Data Structure Manipulations
Manipulating Files
Manipulating Programs
Internet-Related Activities
Bigger Examples
Exercises

At this point, we have covered the syntax of Python, its basic data types, and many of our favorite functions in the Python library. This chapter assumes that all the basic components of the language are at least understood and presents some ways in which Python is, in addition to being elegant and "cool," just plain useful. We present a variety of tasks common to Python programmers. These tasks are grouped by categories--data structure manipulations, file manipulations, etc.

Data Structure Manipulations

One of Python's greatest features is that it provides the list, tuple, and dictionary built-in types. They are so flexible and easy to use that once you've grown used to them, you'll find yourself reaching for them automatically.

Making Copies Inline

Due to Python's reference management scheme, the statement a = b doesn't make a copy of the object referenced by b ; instead, it makes a new reference to that object. Sometimes a new copy of an object, not just a shared reference, is needed. How to do this depends on the type of the object in question. The simplest way to make copies of lists and tuples is somewhat odd. If myList is a list, then to make a copy of it, you can do:


newList = myList[:]

which you can read as "slice from beginning to end," since you'll remember from Chapter 2, Types and Operators, that the default index for the start of a slice is the beginning of the sequence (0), and the default index for the end of a slice is the end of sequence. Since tuples support the same slicing operation as lists, this same technique can also copy tuples. Dictionaries, on the other hand, don't support slicing. To make a copy of a dictionary myDict, you can use:


newDict = {}
for key in myDict.keys():
    newDict[key] = myDict[key]

This is such a common task that a new method was added to the dictionary object in Python 1.5, the copy() method, which performs this task. So the preceding code can be replaced with the single statement:

newDict = myDict.copy()

Another common dictionary operation is also now a standard dictionary feature. If you have a dictionary oneDict, and want to update it with the contents of a different dictionary otherDict, simply type oneDict.update(otherDict). This is the equivalent of:


for key in otherDict.keys():
    oneDict[key] = otherDict[key]

If oneDict shared some keys with otherDict before the update() operation, the old values associated with the keys in oneDict are obliterated by the update. This may be what you want to do (it usually is, which is why this behavior was chosen and why it was called "update"). If it isn't, the right thing to do might be to complain (raise an exception), as in:


def mergeWithoutOverlap(oneDict, otherDict):
    newDict = oneDict.copy()
    for key in otherDict.keys():
        if key in oneDict.keys():
            raise ValueError, "the two dictionaries are sharing keys!"
        newDict[key] = otherDict[key]
    return newDict

or, alternatively, combine the values of the two dictionaries, with a tuple, for example:


def mergeWithOverlap(oneDict, otherDict):
    newDict = oneDict.copy()
    for key in otherDict.keys():
        if key in oneDict.keys():
            newDict[key] = oneDict[key], otherDict[key]
        else:
            newDict[key] = otherDict[key]
    return newDict

To illustrate the differences between the preceding three algorithms, consider the following two dictionaries:


phoneBook1 = {'michael': '555-1212', 'mark': '554-1121', 'emily': '556-0091'}
phoneBook2 = {'latoya': '555-1255', 'emily': '667-1234'}

If phoneBook1 is possibly out of date, and phoneBook2 is more up to date but less complete, the right usage is probably phoneBook1.update(phoneBook2). If the two phoneBooks are supposed to have nonoverlapping sets of keys, using newBook = mergeWithoutOverlap(phoneBook1, phoneBook2) lets you know if that assumption is wrong. Finally, if one is a set of home phone numbers and the other a set of office phone numbers, chances are newBook = mergeWithOverlap(phoneBook1, phoneBook2)is what you want, as long as the subsequent code that uses newBook can deal with the fact that newBook['emily'] is the tuple ('556-0091', '667-1234').

Making Copies: The copy Module

Back to making copies: the [:] and .copy()tricks will get you copies in 90% of the cases. If you are writing functions that, in true Python spirit, can deal with arguments of any type, it's sometimes necessary to make copies of X, regardless of what X is. In comes the copy module. It provides two functions, copy and deepcopy. The first is just like the [:] sequence slice operation or the copy method of dictionaries. The second is more subtle and has to do with deeply nested structures (hence the term deepcopy). Take the example of copying a list listOne by slicing it from beginning to end using the [:] construct. This technique makes a new list that contains references to the same objects contained in the original list. If the contents of that original list are immutable objects, such as numbers or strings, the copy is as good as a "true" copy. However, suppose that the first element in listOne is itself a dictionary (or any other mutable object). The first element of the copy of listOne is a new reference to the same dictionary. So if you then modify that dictionary, the modification is evident in both listOne and the copy of listOne. An example makes it much clearer:


>>> import copy
>>> listOne = [{"name": "Willie", "city": "Providence, RI"}, 1, "tomato", 3.0]
>>> listTwo = listOne[:]                   # or listTwo=copy.copy(listOne)
>>> listThree = copy.deepcopy(listOne)
>>> listOne.append("kid")
>>> listOne[0]["city"] = "San Francisco, CA"
>>> print listOne, listTwo, listThree
[{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0, 'kid']
[{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0]
[{'name': 'Willie', 'city': 'Providence, RI'}, 1, 'tomato', 3.0]

As you can see, modifying listOne directly modified only listOne. Modifying the first entry of the list referenced by listOne led to changes in listTwo, but not in listThree; that's the difference between a shallow copy ([:]) and a deepcopy. The copy module functions know how to copy all the built-in types that are reasonably copyable,[1] including classes and instances.

Sorting and Randomizing

In Chapter 2, you saw that lists have a sort method that does an in-place sort. Sometimes you want to iterate over the sorted contents of a list, without disturbing the contents of this list. Or you may want to list the sorted contents of a tuple. Because tuples are immutable, an operation such as sort, which modifies it in place, is not allowed. The only solution is to make a list copy of the elements, sort the list copy, and work with the sorted copy, as in:


listCopy = list(myTuple)
listCopy.sort()
for item in listCopy:
    print item                             # or whatever needs doing

This solution is also the way to deal with data structures that have no inherent order, such as dictionaries. One of the reasons that dictionaries are so fast is that the implementation reserves the right to change the order of the keys in the dictionary. It's really not a problem, however, given that you can iterate over the keys of a dictionary using an intermediate copy of the keys of the dictionary:


keys = myDict.keys()                       # returns an unsorted list of
                                           # the keys in the dict
keys.sort()
for key in keys:                           # print key, value pairs 
    print key, myDict[key]                 # sorted by key

The sort method on lists uses the standard Python comparison scheme. Sometimes, however, that scheme isn't what's needed, and you need to sort according to some other procedure. For example, when sorting a list of words, case (lower versus UPPER) may not be significant. The standard comparison of text strings, however, says that all uppercase letters "come before" all lowercase letters, so 'Baby' is "less than" 'apple' but 'baby' is "greater than" 'apple'. In order to do a case-independent sort, you need to define a comparison function that takes two arguments, and returns -1, 0, or 1 depending on whether the first argument is smaller than, equal to, or greater than the second argument. So, for our case-independent sorting, you can use:


>>> def caseIndependentSort(something, other):
...    something, other  = string.lower(something), string.lower(other)
...    return cmp(something, other)
... 
>>> testList = ['this', 'is', 'A', 'sorted', 'List']
>>> testList.sort()
>>> print testList
['A', 'List', 'is', 'sorted', 'this']
>>> testList.sort(caseIndependentSort)
>>> print testList
['A', 'is', 'List', 'sorted', 'this']

We're using the built-in function cmp, which does the hard part of figuring out that 'a' comes before 'b', 'b' before 'c', etc. Our sort function simply lowercases both items and sorts the lowercased versions, which is one way of making the comparison case-independent. Also note that the lowercasing conversion is local to the comparison function, so the elements in the list aren't modified by the sort.

Randomizing: The random Module

What about randomizing a sequence, such as a list of lines? The easiest way to randomize a sequence is to repeatedly use the choice function in the random module, which returns a random element from the sequence it receives as an argument.[2] In order to avoid getting the same line multiple times, remember to remove the chosen item. When manipulating a list object, use the remove method:


while myList:                        # will stop looping when myList is empty
    element = random.choice(myList)
    myList.remove(element)
    print element,

If you need to randomize a nonlist object, it's usually easiest to convert that object to a list and randomize the list version of the same data, rather than come up with a new strategy for each data type. This might seem a wasteful strategy, given that it involves building intermediate lists that might be quite large. In general, however, what seems large to you probably won't seem so to the computer, thanks to the reference system. Also, consider the time saved by not having to come up with a different strategy for each data type! Python is designed to save time; if that means running a slightly slower or bigger program, so be it. If you're handling enormous amounts of data, it may be worthwhile to optimize. But never optimize until the need for optimization is clear; that would be a waste of time.

Making New Data Structures

The last point about not reinventing the wheel is especially true when it comes to data structures. For example, Python lists and dictionaries might not be the lists and dictionaries or mappings you're used to, but you should avoid designing your own data structure if these structures will suffice. The algorithms they use have been tested under wide ranges of conditions, and they're fast and stable. Sometimes, however, the interface to these algorithms isn't convenient for a particular task.

For example, computer-science textbooks often describe algorithms in terms of other data structures such as queues and stacks. To use these algorithms, it may make sense to come up with a data structure that has the same methods as these data structures (such as pop and push for stacks or enqueue/dequeue for queues). However, it also makes sense to reuse the built-in list type in the implementation of a stack. In other words, you need something that acts like a stack but is based on a list. The easiest solution is to use a class wrapper around a list. For a minimal stack implementation, you can do this:


class Stack:
    def _ _init_ _(self, data):
        self._data = list(data)
    def push(self, item):
        self._data.append(item)
    def pop(self):
        item = self._data[-1]
        del self._data[-1]
        return item

The following is simple to write, to understand, to read, and to use:


>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid'])
>>> thingsToDo.push('do the dishes')
>>> print thingsToDo.pop()
do the dishes
>>> print thingsToDo.pop()
wash the kid

Two standard Python naming conventions are used in the Stack class above. The first is that class names start with an uppercase letter, to distinguish them from functions. The other is that the _data attribute starts with an underscore. This is a half-way point between public attributes (which don't start with an underscore), private attributes (which start with two underscores; see Chapter 6, Classes), and Python-reserved identifiers (which both start and end with two underscores). What it means is that _data is an attribute of the class that shouldn't be needed by clients of the class. The class designer expects such "pseudo-private" attributes to be used only by the class methods and by the methods of any eventual subclass.

Making New Lists and Dictionaries: The UserList and UserDict Modules

The Stack class presented earlier does its minimal job just fine. It assumes a fairly minimal definition of what a stack is, specifically, something that supports just two operations, a push and a pop. Quickly, however, you find that some of the features of lists are really nice, such as the ability to iterate over all the elements using the for...in... construct. This can be done by reusing existing code. In this case, you should use the UserList class defined in the UserList module as a class from which the Stack can be derived. The library also includes a UserDict module that is a class wrapper around a dictionary. In general, they are there to be specialized by subclassing. In our case:


# import the UserList class from the UserList module
from UserList import UserList
 
# subclass the UserList class
class Stack(UserList):
    push = UserList.append
    def pop(self):
        item = self[-1]                    # uses _ _getitem_ _
        del self[-1] 
        return item

This Stack is a subclass of the UserList class. The UserList class implements the behavior of the [] brackets by defining the special _ _getitem_ _ and _ _delitem_ _ methods among others, which is why the code in pop works. You don't need to define your own _ _init_ _ method because UserList defines a perfectly good default. Finally, the push method is defined just by saying that it's the same as UserList's append method. Now we can do list-like things as well as stack-like things:


>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid'])
>>> print thingsToDo                  # inherited from UserList
['write to mom', 'invite friend over', 'wash the kid']
>>> thingsToDo.pop()        
'wash the kid'
>>> thingsToDo.push('change the oil')
>>> for chore in thingsToDo:          # we can also iterate over the contents
...    print chore                    # as "for .. in .." uses _ _getitem_ _
...
write to mom
invite friend over
change the oil

 

NOTE: As this book was being written, Guido van Rossum announced that in Python 1.5.2 (and subsequent versions), list objects now have an additional method called pop, which behaves just like the one here. It also has an optional argument that specifies what index to use to do the pop (with the default being the last element in the list).

Manipulating Files

Scripting languages were designed in part in order to help people do repetitive tasks quickly and simply. One of the common things webmasters, system administrators, and programmers need to do is to take a set of files, select a subset of those files, do some sort of manipulation on this subset, and write the output to one or a set of output files. (For example, in each file in a directory, find the last word of every other line that starts with something other than the # character, and print it along with the name of the file.) This is a task for which special-purpose tools have been developed, such as sed and awk. We find that Python does the job just fine using very simple tools.

Doing Something to Each Line in a File

The sys module is most helpful when it comes to dealing with an input file, parsing the text it contains and processing it. Among its attributes are three file objects, called sys.stdin, sys.stdout, and sys.stderr. The names come from the notion of the three streams, called standard in, standard out, and standard error, which are used to connect command line tools. Standard output (stdout) is used by every print statement. It's a file object with all the output methods of file objects opened in write mode, such as write and writelines. The other often-used stream is standard in (stdin), which is also a file object, but with the input methods, such as read, readline, and readlines. For example, the following script counts all the lines in the file that is "piped in":


import sys
data = sys.stdin.readlines()
print "Counted", len(data), "lines."

On Unix, you could test it by doing something like:


% cat countlines.py | python countlines.py 
Counted 3 lines.

On Windows or DOS, you'd do:


C:\> type countlines.py | python countlines.py 
Counted 3 lines.

The readlines function is useful when implementing simple filter operations. Here are a few examples of such filter operations:

Finding all lines that start with a #

import sys
for line in sys.stdin.readlines():
    if line[0] == '#':
        print line,
Note that a final comma is needed after the print statement because the line string already includes a newline character as its last character.
Extracting the fourth column of a file (where columns are defined by whitespace)

import sys, string
for line in sys.stdin.readlines():
    words = string.split(line) 
    if len(words) >= 4:
        print words[3]
We look at the length of the words list to find if there are indeed at least four words. The last two lines could also be replaced by the try/except idiom, which is quite common in Python:

    try:
        print words[3]
    except IndexError:                     # there aren't enough words
        pass
Extracting the fourth column of a file, where columns are separated by colons, and lowercasing it

import sys, string
for line in sys.stdin.readlines():
    words = string.split(line, ':') 
    if len(words) >= 4:
        print string.lower(words[3])
Printing the first 10 lines, the last 10 lines, and every other line

import sys, string
lines = sys.stdin.readlines()
sys.stdout.writelines(lines[:10])          # first ten lines
sys.stdout.writelines(lines[-10:])         # last ten lines
for lineIndex in range(0, len(lines), 2):  # get 0, 2, 4, ...
    sys.stdout.write(lines[lineIndex])     # get the indexed line
Counting the number of times the word "Python" occurs in a file

import string
text = open(fname).read()
print string.count(text, 'Python')
Changing a list of columns into a list of rows
In this more complicated example, the task is to "transpose" a file; imagine you have a file that looks like:

Name:   Willie   Mark   Guido   Mary  Rachel   Ahmed
Level:    5       4      3       1     6        4
Tag#:    1234   4451   5515    5124   1881    5132
And you really want it to look like the following instead:

Name:  Level:  Tag#:
Willie 5       1234
Mark   4       4451
...
You could use code like the following:

import sys, string
lines = sys.stdin.readlines()
wordlists = []
for line in lines:
    words = string.split(line)
    wordlists.append(words)
for row in range(len(wordlists[0])):
    for col in range(len(wordlists)):
        print wordlists[col][row] + '\t',
    print
Of course, you should really use much more defensive programming techniques to deal with the possibility that not all lines have the same number of words in them, that there may be missing data, etc. Those techniques are task-specific and are left as an exercise to the reader.

Choosing chunk sizes

All the preceding examples assume you can read the entire file at once (that's what the readlines call expects). In some cases, however, that's not possible, for example when processing really huge files on computers with little memory, or when dealing with files that are constantly being appended to (such as log files). In such cases, you can use a while/readline combination, where some of the file is read a bit at a time, until the end of file is reached. In dealing with files that aren't line-oriented, you must read the file a character at a time:


# read character by character
while 1:
    next = sys.stdin.read(1)            # read a one-character string
    if not next:                        # or an empty string at EOF
        break
       Process character 'next'

Notice that the read() method on file objects returns an empty string at end of file, which breaks out of the while loop. Most often, however, the files you'll deal with consist of line-based data and are processed a line at a time:


# read line by line
while 1:
    next = sys.stdin.readline()         # read a one-line string
    if not next:                        # or an empty string at EOF
        break
       Process line 'next'

Doing Something to a Set of Files Specified on the Command Line

Being able to read stdin is a great feature; it's the foundation of the Unix toolset. However, one input is not always enough: many tasks need to be performed on sets of files. This is usually done by having the Python program parse the list of arguments sent to the script as command-line options. For example, if you type:


% python myScript.py input1.txt input2.txt input3.txt output.txt

you might think that myScript.py wants to do something with the first three input files and write a new file, called output.py. Let's see what the beginning of such a program could look like:


import sys
inputfilenames, outputfilename = sys.argv[1:-1], sys.argv[-1]

for inputfilename in inputfilenames:
    inputfile = open(inputfilename, "r")
    do_something_with_input(inputfile)
outputfile = open(outputfilename, "w")
write_results(outputfile)

The second line extracts parts of the argv attribute of the sys module. Recall that it's a list of the words on the command line that called the current program. It starts with the name of the script. So, in the example above, the value of sys.argv is:


['myScript.py', 'input1.txt', 'input2.txt', 'input3.txt', 'output.txt']. 

The script assumes that the command line consists of one or more input files and one output file. So the slicing of the input file names starts at 1 (to skip the name of the script, which isn't an input to the script in most cases), and stops before the last word on the command line, which is the name of the output file. The rest of the script should be pretty easy to understand (but won't work until you provide the do_something_with_input() and write_results() functions).

Note that the preceding script doesn't actually read in the data from the files, but passes the file object down to a function to do the real work. Such a function often uses the readlines() method on file objects, which returns a list of the lines in that file. A generic version of do_something_with_input() is:


def do_something_with_input(inputfile):
    for line in inputfile.readlines()
        process(line)

Processing Each Line of One or More Files:
The fileinput Module

The combination of this idiom with the preceding one regarding opening each file in the sys.argv[1:] list is so common that Python 1.5 introduced a new module that's designed to help do just this task. It's called fileinput and works like this:


import fileinput
for line in fileinput.input():
    process(line)

The fileinput.input() call parses the arguments on the command line, and if there are no arguments to the script, uses sys.stdin instead. It also provides a bunch of useful functions that let you know which file and line number you're currently manipulating:


import fileinput, sys, string
# take the first argument out of sys.argv and assign it to searchterm
searchterm, sys.argv[1:] = sys.argv[1], sys.argv[2:]
for line in fileinput.input():
   num_matches = string.count(line, searchterm)
   if num_matches:                     # a nonzero count means there was a match
       print "found '%s' %d times in %s on line %d." % (searchterm, num_matches, 
           fileinput.filename(), fileinput.filelineno())

If this script were called mygrep.py, it could be used as follows:


% python mygrep.py in *.py
found 'in' 2 times in countlines.py on line 2.
found 'in' 2 times in countlines.py on line 3.
found 'in' 2 times in mygrep.py on line 1.
found 'in' 4 times in mygrep.py on line 4.
found 'in' 2 times in mygrep.py on line 5.
found 'in' 2 times in mygrep.py on line 7.
found 'in' 3 times in mygrep.py on line 8.
found 'in' 3 times in mygrep.py on line 12.

Filenames and Directories

We have now covered reading existing files, and if you remember the discussion on the open built-in function in Chapter 2, you know how to create new files. There are a lot of tasks, however, that need different kinds of file manipulations, such as directory and path management and removing files. Your two best friends in such cases are the os and os.path modules described in Chapter 8, Built-in Tools.

Let's take a typical example: you have lots of files, all of which have a space in their name, and you'd like to replace the spaces with underscores. All you really need is the os.curdir attribute (which returns an operating-system specific string that corresponds to the current directory), the os.listdir function (which returns the list of filenames in a specified directory), and the os.rename function:


import os, string
if len(sys.argv) == 1:                     # if no filenames are specified,
    filenames = os.listdir(os.curdir)      #   use current dir
else:                                      # otherwise, use files specified
    filenames = sys.argv[1:]               #   on the command line
for filename in filenames:
    if ' ' in filename:
        newfilename = string.replace(filename, ' ', '_')
        print "Renaming", filename, "to", newfilename, "..."
        os.rename(filename, newfilename)

This program works fine, but it reveals a certain Unix-centrism. That is, if you call it with wildcards, such as:


python despacify.py *.txt

you find that on Unix machines, it renames all the files with names with spaces in them and that end with .txt. In a DOS-style shell, however, this won't work because the shell normally used in DOS and Windows doesn't convert from *.txt to the list of filenames; it expects the program to do it. This is called globbing, because the * is said to match a glob of characters.

Matching Sets of Files: The glob Module

The glob module exports a single function, also called glob, which takes a filename pattern and returns a list of all the filenames that match that pattern (in the current working directory):


import sys, glob, operator
print sys.argv[1:]
sys.argv = reduce(operator.add, map(glob.glob, sys.argv))
print sys.argv[1:]

Running this on Unix and DOS shows that on Unix, the Python glob didn't do anything because the globbing was done by the Unix shell before Python was invoked, and on DOS, Python's globbing came up with the same answer:


/usr/python/book$ python showglob.py *.py
['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']
['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']
 
C:\python\book> python showglob.py *.py
['*.py']
['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']

This script isn't trivial, though, because it uses two conceptually difficult operations; a map followed by a reduce. map was mentioned in Chapter 4, Functions, but reduce is new to you at this point (unless you have background in LISP-type languages). map is a function that takes a callable object (usually a function) and a sequence, calls the callable object with each element of the sequence in turn, and returns a list containing the values returned by the function. For an graphical representation of what map does, see Figure 9-1. [3]

Figure 9-1. Graphical representation of the behavior of the map built-in

 

map is needed here (or something equivalent) because you don't know how many arguments were entered on the command line (e.g., it could have been *.py *.txt *.doc). So the glob.glob function is called with each argument in turn. Each glob.glob call returns a list of filenames that match the pattern. The map operation then returns a lists of lists, which you need to convert to a single list--the combination of all the lists in this list of lists. That means doing list1 + list2 + ... + listN. That's exactly the kind of situation where the reduce function comes in handy.

Just as with map, reduce takes a function as its first argument and applies it to the first two elements of the sequence it receives as its second argument. It then takes the result of that call and calls the function again with that result and the next element in the sequence, etc. (See Figure 9-2 for an illustration of reduce.) But wait: you need + applied to a set of things, and + doesn't look like a function (it isn't). So a function is needed that works the same as +. Here's one:


define myAdd(something, other):
    return something + other

You would then use reduce(myAdd, map(...)). This works fine, but better yet, you can use the add function defined in the operator module, which does the same thing. The operator module defines functions for every syntactic operation in Python (including attribute-getting and slicing), and you should use those instead of homemade ones for two reasons. First, they've been coded, debugged, and tested by Guido, who has a pretty good track record at writing bugfree code. Second, they're actually C functions, and applying reduce (or map, or filter) to C functions results in much faster performance than applying it to Python functions. This clearly doesn't matter when all you're doing is going through a few hundred files once. If you do thousands of globs all the time, however, speed can become an issue, and now you know how to do it quickly.

Figure 9-2. Graphical representation of the behavior of the reduce built-in

 

The filter built-in function, like map and reduce, takes a function and a sequence as arguments. It returns the subset of the elements in the sequence for which the specified function returns something that's true. To find all of the even numbers in a set, type this:


>>> numbers = range(30)
>>> def even(x):
...     return x % 2 == 0
...
>>> print numbers
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
21, 22, 23, 24, 25, 26, 27, 28, 29]
>>> print filter(even, numbers)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

Or, if you wanted to find all the words in a file that are at least 10 characters long, you could use:


import string
words = string.split(open('myfile.txt').read())         # get all the words
 
def at_least_ten(word): 
    return len(word) >= 10
 
longwords = filter(at_least_ten, words)

For a graphical representation of what filter does, see Figure 9-3. One nice special feature of filter is that if one passes None as the first argument, it filters out all false entries in the sequence. So, to find all the nonempty lines in a file called myfile.txt, do this:


lines = open('myfile.txt').readlines()
lines = filter(None, lines)             # remember, the empty string is false

map, filter, and reduce are three powerful constructs, and they're worth knowing about; however, they are never necessary. It's fairly simple to write a Python function that does the same thing as any of them. The built-in versions are just as fast, especially when operating on built-in functions written in C, such as the functions in the operator module.

Figure 9-3. Graphical representation of the behavior of the filter built-in

 

Using Temporary Files

If you've ever written a shell script and needed to use intermediary files for storing the results of some intermediate stages of processing, you probably suffered from directory litter. You started out with 20 files called log_001.txt, log_002.txt etc., and all you wanted was one summary file called log_sum.txt. In addition, you had a whole bunch of log_001.tmp, log_001.tm2, etc. files that, while they were labeled temporary, stuck around. At least that's what we've seen happen in our own lives. To put order back into your directories, use temporary files in specific directories and clean them up afterwards.

To help in this temporary file-management problem, Python provides a nice little module called tempfile that publishes two functions: mktemp() and TemporaryFile(). The former returns the name of a file not currently in use in a directory on your computer reserved for temporary files (such as /tmp on Unix or C:\TMP on Windows). The latter returns a new file object directly. For example:


# read input file
inputFile = open('input.txt', 'r')
 
import tempfile
# create temporary file
tempFile = tempfile.TemporaryFile()                   # we don't even need to 
first_process(input = inputFile, output = tempFile)   # know the filename...
 
# create final output file
outputFile = open('output.txt', 'w')
second_process(input = tempFile, output = outputFile)

Using tempfile.TemporaryFile() works well in cases where the intermediate steps manipulate file objects. One of its nice features is that when it's deleted, it automatically deletes the file it created on disk, thus cleaning up after itself. One important use of temporary files, however, is in conjunction with the os.system call, which means using a shell, hence using filenames, not file objects. For example, let's look at a program that creates form letters and mails them to a list of email addresses (on Unix only):


formletter = """Dear %s,\nI'm writing to you to suggest that ..."""    # etc. 
myDatabase = [('Bill Clinton', 'bill@whitehouse.gov.us'),
              ('Bill Gates', 'bill@microsoft.com'),
              ('Bob', 'bob@subgenius.org')]
for name, email in myDatabase:
    specificLetter = formletter % name
    tempfilename = tempfile.mktemp()
    tempfile = open(tempfilename, 'w')
    tempfile.write(specificLetter)
    tempfile.close()
    os.system('/usr/bin/mail %(email)s -s "Urgent!" < %(tempfile)s' % vars()) 
    os.remove(tempfilename)

The first line in the for loop returns a customized version of the form letter based on the name it's given. That text is then written to a temporary file that's emailed to the appropriate email address using the os.system call (which we'll cover later in this chapter). Finally, to clean up, the temporary file is removed. If you forgot how the % bit works, go back to Chapter 2 and review it; it's worth knowing. The vars() function is a built-in function that returns a dictionary corresponding to the variables defined in the current local namespace. The keys of the dictionary are the variable names, and the values of the dictionary are the variable values. vars() comes in quite handy for exploring namespaces. It can also be called with an object as an argument (such as a module, a class, or an instance), and it will return the namespace of that object. Two other built-ins, locals() and globals(), return the local and global namespaces, respectively. In all three cases, modifying the returned dictionaries doesn't guarantee any effect on the namespace in question, so view these as read-only and you won't be surprised. You can see that the vars() call creates a dictionary that is used by the string interpolation mechanism; it's thus important that the names inside the %(...)s bits in the string match the variable names in the program.

More on Scanning Text Files

Suppose you've run a program that stores its output in a text file, which you need to load. The program creates a file that's composed of a series of lines that each contain a value and a key separated by whitespace:


value key
value key
value key
and so on...

A key can appear on more than one line in the file, and you'd probably like to collect all the values that appear for each given key as you scan the file. Here's one way to solve this problem:


#!/usr/bin/env python
import sys, string
 
entries = {}
for line in open(sys.argv[1], 'r').readlines():
    left, right = string.split(line)    
    try:                                
        entries[right].append(left)       # extend list
    except KeyError:
        entries[right] = [left]           # first time seen
 
for (right, lefts) in entries.items():
  print "%04d '%s'\titems => %s" % (len(lefts), right, lefts)

This script uses the readlines method to scan the text file line by line, and calls the built-in string.split function to chop the line into a list of substrings--a list containing the value and key strings separated by blanks or tabs in the file. To store all occurrences of a key, the script uses a dictionary called entries. The try statement in the loop tries to add new values to an existing entry for a key; if no entry exists for the key, it creates one. Notice that the try could be replaced with an if here:


if entries.has_key(right):        # is it already in the dictionary?
    entries[right].append(left)   # add to the list of current values for key
else:
    entries[right] = [left]       # initialize key's values list

Testing whether a dictionary contains a key is sometimes faster than catching an exception with the try technique; it depends on how many times the test is true. Here's an example of this script in action. The input filename is passed in as a command-line argument (sys.argv[1]):


% cat data.txt
1       one
2       one
3       two
7       three
8       two
10      one
14      three
19      three
20      three
30      three
 
% python collector1.py data.txt
0003 'one'      items => ['1', '2', '10']
0005 'three'    items => ['7', '14', '19', '20', '30']
0002 'two'      items => ['3', '8']

You can make this code more useful by packaging the scanner logic in a function that returns the entries dictionary as a result and wrapping the printing loop logic at the bottom in an if test:


#!/usr/bin/env python
import sys, string
 
def collect(file):
    entries = {}
    for line in file.readlines():
        left, right = string.split(line)    
        try:                                
            entries[right].append(left)           # extend list
        except KeyError:
            entries[right] = [left]               # first time seen
    return entries
 
if _ _name_ _ == "_ _main_ _":                        # when run as a script
    if len(sys.argv) == 1:
        result = collect(sys.stdin)               # read from stdin stream
    else:
        result = collect(open(sys.argv[1], 'r'))  # read from passed filename
    for (right, lefts) in result.items():
        print "%04d '%s'\titems => %s" % (len(lefts), right, lefts)

This way, the program becomes a bit more flexible. By using the if _ _name_ _ == "_ _main_ _" trick, you can still run it as a top-level script (and get a display of the results), or import the function it defines and process the resulting dictionary explicitly:


# run as a script file
% collector2.py < data.txt
result displayed here...
 
# use in some other component (or interactively)
from collector2 import collect
result = collect(open("spam.txt", "r"))
process result here...

Since the collect function accepts an open file object, it also works on any object that provides the methods (i.e., interface) built-in files do. For example, if you want to read text from a simple string, wrap it in a class that implements the required interface and pass an instance of the class to the collect function:


>>> from collector2 import collect
>>> from StringIO import StringIO
>>> 
>>> str = StringIO("1 one\n2 one\n3 two")
>>> result = collect(str)                   # scans the wrapped string
>>> print result                             # {'one':['1','2'],'two':['3']}

This code uses the StringIO class in the standard Python library to wrap the string into an instance that has all the methods file objects have; see the Library Reference for more details on StringIO. You could also write a different class or subclass from StringIO if you need to modify its behavior. Regardless, the collect function happily reads text from the string str, which happens to be an in-memory object, not a file.

The main reason all this works is that the collect function was designed to avoid making assumptions about the type of object its file parameter references. As long as the object exports a readlines method that returns a list of strings, collect doesn't care what type of object it processes. The interface is all that matters. This runtime binding[4] is an important feature of Python's object system, and allows you to easily write component programs that communicate with other components. For instance, consider a program that reads and writes satellite telemetry data using the standard file interface. By plugging in an object with the right sort of interface, you can redirect its streams to live sockets, GUI boxes, web interfaces, or databases without changing the program itself or even recompiling it.

Manipulating Programs

Calling Other Programs

Python can be used like a shell scripting language, to steer other tools by calling them with arguments the Python program determines at runtime. So, if you have to run a specific program (call it analyzeData) with various data files and various parameters specified on the command line, you can use the os.system() call, which takes a string specifying a command to run in a subshell. Specifically:


for datafname in ['data.001', 'data.002', 'data.003']:
  for parameter1 in range(1, 10):
    os.system("analyzeData -in %(datafname)s -param1 %(paramter1)d" % vars()) 

If analyzeData is a Python program, you're better off doing it without invoking a subshell; simply use the import statement up front and a function call in the loop. Not every useful program out there is a Python program, though.

In the preceding example, the output of analyzeData is most likely either a file or standard out. If it's standard out, it would be nice to be able to capture its output. The popen() function call is an almost standard way to do this. We'll show it off in a real-world task.

When we were writing this book, we were asked to avoid using tabs in source-code listings and use spaces instead. Tabs can wreak havoc with typesetting, and since indentation matters in Python, incorrect typesetting has the potential to break examples. But since old habits die hard (at least one of us uses tabs to indent his own Python code), we wanted a tool to find any tabs that may have crept into our code before it was shipped off for publication. The following script, findtabs.py, does the trick:


#!/usr/bin/env python
# find files, search for tabs
 
import string, os
cmd = 'find . -name "*.py" -print'         # find is a standard Unix tool
 
for file in os.popen(cmd).readlines():     # run find command
    num  = 1
    name = file[:-1]                       # strip '\n'
    for line in open(name).readlines():    # scan the file
        pos = string.find(line, "\t")
        if  pos >= 0:
            print name, num, pos           # report tab found
            print '....', line[:-1]        # [:-1] strips final \n
            print '....', ' '*pos + '*', '\n'
        num = num+1

This script uses two nested for loops. The outer loop uses os.popen to run a find shell command, which returns a list of all the Python source filenames accessible in the current directory and its subdirectories. The inner loop reads each line in the current file, using string.find to look for tabs. But the real magic in this script is in the built-in tools it employs:

os.popen
Takes a shell command passed in as a string (called cmd in the example) and returns a file-like object connected to the command's standard input or output streams. Output is the default if you don't pass an explicit "r" or "w" mode argument. By reading the file-like object, you can intercept the command's output as we did here--the result of the find. It turns out that there's a module in the standard library called find.py that provides a function that does a very similar thing to our use of popen with the find Unix command. As an exercise, you could rewrite findtabs.py to use it instead.
string.find
Returns the index of the first occurrence of one string in another, searching from left to right. In the script, we use it to look for a tab, passed in as an (escaped) one-character string ('\t').

When a tab is found, the script prints the matching line, along with a pointer to where the tab occurs. Notice the use of string repetition: the expression ' '*pos moves the print cursor to the right, up to the index of the first tab. Use double quotes inside a single-quoted string without backslash escapes in cmd. Here is the script at work, catching illegal tabs in the unfortunately named file happyfingers.py :


C:\python\book-examples> python findtabs.py
./happyfingers.py 2 0
....   for i in range(10):
.... *
 
./happyfingers.py 3 0
....           print "oops..."
.... *
 
./happyfingers.py 5 5
.... print      "bad style"
....      *

A note on portability: the find shell command used in the findtabs script is a Unix command, which may or may not be available on other platforms (it ran under Windows in the listing above because a find utility program was installed). os.popen functionality is available as win32pipe.popen in the win32 extensions to Python for Windows.[5] If you want to write code that catches shell command output portably, use something like the following code early in your script:


import sys
if sys.platform == "win32":                # on a Windows port
    try:
        import win32pipe
        popen = win32pipe.popen
    except ImportError:
        raise ImportError, "The win32pipe module could not be found"
else:                                      # else on POSIX box
    import os
    popen = os.popen
...And use popen in blissful platform ignorance 

The sys.platform attribute is always preset to a string that identifies the underlying platform (and hence the Python port you're using). Although the Python language isn't platform-dependent, some of its libraries may be; checking sys.platform is the standard way to handle cases where they are. Notice the nested import statements here; as we've seen, import is just an executable statement that assigns a variable name.

Internet-Related Activities

The Internet is a treasure trove of information, but its exponential growth can make it hard to manage. Furthermore, most tools currently available for "surfing the Web" are not programmable. Many web-related tasks can be automated quite simply with the tools in the standard Python distribution.

Downloading a Web Page Programmatically

If you're interested in finding out what the weather in a given location is over a period of months, it's much easier to set up an automated program to get the information and collect it in a file than to have to remember to do it by hand.

Here is a program that finds the weather in a couple of cities and states using the pages of the weather.com web site:


import urllib, urlparse, string, time
 
def get_temperature(country, state, city):
    url = urlparse.urljoin('http://www.weather.com/weather/cities/',
                           string.lower(country)+'_' + \
                           string.lower(state) + '_' + \
                           string.replace(string.lower(city), ' ',
                                          '_') + '.html')
    data = urllib.urlopen(url).read()
    start = string.index(data, 'current temp: ') + len('current temp: ')
    stop = string.index(data, '&degv;F', start-1)
    temp = int(data[start:stop])
    localtime = time.asctime(time.localtime(time.time()))
    print ("On %(localtime)s, the temperature in %(city)s, " +\
           "%(state)s %(country)s is %(temp)s F.") % vars()
 
get_temperature('FR', '', 'Paris')
get_temperature('US', 'RI', 'Providence')
get_temperature('US', 'CA', 'San Francisco')

When run, it produces output like:


~/book:> python get_temperature.py
On Wed Nov 25 16:22:25 1998, the temperature in Paris,  FR is 39 F.
On Wed Nov 25 16:22:30 1998, the temperature in Providence, RI US is 39 F.
On Wed Nov 25 16:22:35 1998, the temperature in San Francisco, CA US is 58 F.

The code in get_temperature.py suffers from one flaw, which is that the logic of the URL creation and of the temperature extraction is dependent on the specific HTML produced by the web site you use. The day the site's graphic designer decides that "current temp:" should be spelled with capitalized words, this script won't work. This is a problem with programmatic parsing of web pages that will go away only when more structural formats (such as XML) are used to produce web pages.[6]

Checking the Validity of Links and Mirroring
Web Sites: webchecker.py and Friends

One of the big hassles of maintaining a web site is that as the number of links in the site increases, so does the chance that some of the links will no longer be valid. Good web-site maintenance therefore includes periodic checking for such stale links. The standard Python distribution includes a tool that does just this. It lives in the Tools/webchecker directory and is called webchecker.py.

A companion program called websucker.py located in the same directory uses similar logic to create a local copy of a remote web site. Be careful when trying it out, because if you're not careful, it will try to download the entire Web on your machine! The same directory includes two programs called wsgui.py and webgui.py that are Tkinter-based frontends to websucker and webchecker, respectively. We encourage you to look at the source code for these programs to see how one can build sophisticated web-management systems with Python's standard toolset.

In the Tools/Scripts directory, you'll find many other small to medium-sized scripts that might be of interest, such as an equivalent of websucker.py for FTP servers called ftpmirror.py.

Checking Mail

Electronic mail is probably the most important medium on the Internet today; it's certainly the protocol with which most information passes between individuals. Python includes several libraries for processing mail. The one you'll need to use depends on the kind of mail server you're using. Modules for interacting with POP3 servers (poplib) and IMAP servers (imaplib) are included. If you need to talk to a Microsoft Exchange server, you'll need some of the tools in the win32 distribution (see Appendix B, Platform-Specific Topics, for pointers to the win32 extensions web page).

Here's a simple test of the poplib module, which is used to talk to a mail server running the POP protocol:


>>> from poplib import *
>>> server = POP3('mailserver.spam.org')
>>> print server.getwelcome()
+OK QUALCOMM Pop server derived from UCB (version 2.1.4-R3) at spam starting.
>>> server.user('da')
'+OK Password required for da.'
>>> server.pass_('youllneverguess')
'+OK da has 153 message(s) (458167 octets).'
>>> header, msg, octets = server.retr(152) # let's get the latest msgs
>>> import string
>>> print string.join(msg[:3], '\n')   # and look at the first three lines
Return-Path: <jim@bigbad.com>
Received: from gator.bigbad.com by mailserver.spam.org (4.1/SMI-4.1)
        id AA29605; Wed, 25 Nov 98 15:59:24 PST

In a real application, you'd use a specialized module such as rfc822 to parse the header lines, and perhaps the mimetools and mimify modules to get the data out of the message body (e.g., to process attached files).

Bigger Examples

Compounding Your Interest

Someday, most of us hope to put a little money away in a savings account (assuming those student loans ever go away). Banks hope you do too, so much so that they'll pay you for the privilege of holding onto your money. In a typical savings account, your bank pays you interest on your principal. Moreover, they keep adding the percentage they pay you back to your total, so that your balance grows a little bit each year. The upshot is that you need to project on a year-by-year basis if you want to track the growth in your savings. This program, interest.py, is an easy way to do it in Python:


trace = 1  # print each year?
 
def calc(principal, interest, years):
    for y in range(years):
        principal = principal * (1.00 + (interest / 100.0))
        if trace: print y+1, '=> %.2f' % principal
    return principal

This function just loops through the number of years you pass in, accumulating the principal (your initial deposit plus all the interest added so far) for each year. It assumes that you'll avoid the temptation to withdraw money. Now, suppose we have $65,000 to invest in a 5.5% interest yield account, and want to track how the principal will grow over 10 years. We import and call our compounding function passing in a starting principal, an interest rate, and the number of years we want to project:


% python
>>> from interest import calc
>>> calc(65000, 5.5, 10)
1 => 68575.00
2 => 72346.63
3 => 76325.69
4 => 80523.60
5 => 84952.40
6 => 89624.78
7 => 94554.15
8 => 99754.62
9 => 105241.13
10 => 111029.39
111029.389793

and we wind up with $111,029. If we just want to see the final balance, we can set the trace global (module-level) variable in interest to 0 before we call the calc function:


>>> import interest
>>> interest.trace = 0
>>> calc(65000, 5.5, 10)
111029.389793

Naturally, there are many ways to calculate compound interest. For example, the variation of the interest calculator function below adds to the principal explicitly, and prints both the interest earned (earnings) and current balance (principal) as it steps through the years:


def calc(principal, interest, years):
    interest = interest / 100.0
    for y in range(years):
        earnings  = principal * interest
        principal = principal + earnings
        if trace: print y+1, '(+%d)' % earnings, '=> %.2f' % principal
    return principal

We get the same results with this version, but more information:


>>> interest.trace = 1
>>> calc(65000, 5.5, 10)
1 (+3575) => 68575.00
2 (+3771) => 72346.63
3 (+3979) => 76325.69
4 (+4197) => 80523.60
5 (+4428) => 84952.40
6 (+4672) => 89624.78
7 (+4929) => 94554.15
8 (+5200) => 99754.62
9 (+5486) => 105241.13
10 (+5788) => 111029.39
111029.389793

The last comment on this script is that it may not give you exactly the same numbers as your bank. Bank programs tend to round everything off to the cent on a regular basis. Our program rounds off the numbers to the cent when printing the results (that's what the %.2f does; see Chapter 2 for details), but keeps the full precision afforded by the computer in its intermediate computation (as shown in the last line).

An Automated Dial-Out Script

One upon a time, a certain book's coauthor worked at a company without an Internet feed. The system support staff did, however, install a dial-out modem on site, so anyone with a personal Internet account and a little Unix savvy could connect to a shell account and do all their Internet business at work. Dialing out meant using the Kermit file transfer utility.

One drawback with the modem setup was that people wanting to dial out had to keep trying each of 10 possible modems until one was free (dial on one; if it's busy, try another, and so on). Since modems were addressable under Unix using the filename pattern /dev/modem*, and modem locks via /var/spool/locks/LCK*modem*, a simple Python script was enough to check for free modems automatically. The following program, dokermit, uses a list of integers to keep track of which modems are locked, glob.glob to do filename expansion, and os.system to run a kermit command when a free modem has been found:


#!/usr/bin/env python
# find a free modem to dial out on
 
import glob, os, string
LOCKS = "/var/spool/locks/"
 
locked = [0] * 10
for lockname in glob.glob(LOCKS + "LCK*modem*"):    # find locked modems
    print "Found lock:", lockname
    locked[string.atoi(lockname[-1])] = 1           # 0..9 at end of name
 
print 'free: ',
for i in range(10):                                 # report, dial-out
    if not locked[i]: print i,
print
 
for i in range(10):
    if not locked[i]:
        if raw_input("Try %d? " % i) == 'y':
            os.system("kermit -m hayes -l /dev/modem%d -b 19200 -S" % i)
            if raw_input("More? ") != 'y': break

By convention, modem lock files have the modem number at the end of their names; we use this hook to build a modem device name in the Kermit command. Notice that this script keeps a list of 10 integer flags to mark which modems are free (1 means locked). The program above works only if there are 10 or fewer modems; if there are more, you'd need to use larger lists and loops, and parse the lock filename, not just look at its last character.

An Interactive Rolodex

While most of the preceding examples use lists as the primary data structures, dictionaries are in many ways more powerful and fun to use. Their presence as a built-in data type is part of what makes Python high level, which basically means "easy to use for complex tasks." Complementing this rich set of built-in data types is an extensive standard library. One powerful module in this library is the cmd module that provides a class Cmd you can subclass to make simple command-line interpreter. The following example is fairly large, but it's really not that complicated, and illustrates well the power of dictionaries and of reuse of standard modules.

The task at hand is to keep track of names and phone numbers and allow the user to manipulate this list using an interactive interface, with error checking and user-friendly features such as online help. The following example shows the kind of interaction our program allows:


% python rolo.py
Monty's Friends: help                       
 
Documented commands (type help <topic>):
========================================
EOF             add             find            list            load
save
 
Undocumented commands:
======================
help            

We can get help on specific commands:


Monty's Friends: help find              # compare with the help_find() method
Find an entry (specify a name)

We can manipulate the entries of the Rolodex easily enough:


Monty's Friends: add larry                  # we can add entries
Enter Phone Number for larry: 555-1216
Monty's Friends: add                        # if the name is not specified...
Enter Name: tom                             # ...the program will ask for it
Enter Phone Number for tom: 555-1000
Monty's Friends: list
=========================================
               larry : 555-1216
                 tom : 555-1000
=========================================
Monty's Friends: find larry
The number for larry is 555-1216.
Monty's Friends: save myNames             # save our work
Monty's Friends: ^D                       # quit the program  (^Z on Windows)

And the nice thing is, when we restart this program, we can recover the saved data:


% python rolo.py                       # restart
Monty's Friends: list                  # by default, there is no one listed
Monty's Friends: load myNames          # it only takes this to reload the dir
Monty's Friends: list
=========================================
               larry : 555-1216
                 tom : 555-1000
=========================================

Most of the interactive interpreter functionality is provided by the Cmd class in the cmd module, which just needs customization to work. Specifically, you need to set the prompt attribute and add some methods that start with do_ and help_. The do_ methods must take a single argument, and the part after the do_ is the name of the command. Once you call the cmdloop() method, the Cmd class does the rest. Read the following code, rolo.py, one method at a time and compare the methods with the previous output:


#!/usr/bin/env python 
# An interactive rolodex
 
import string, sys, pickle, cmd
 
class Rolodex(cmd.Cmd):
 
    def _ _init_ _(self):
        cmd.Cmd._ _init_ _(self)              # initialize the base class
        self.prompt = "Monty's Friends: "   # customize the prompt
        self.people = {}                    # at first, we know nobody
 
    def help_add(self): 
        print "Adds an entry (specify a name)"
    def do_add(self, name):
        if name == "": name = raw_input("Enter Name: ")
        phone = raw_input("Enter Phone Number for "+ name+": ")
        self.people[name] = phone           # add phone number for name
 
    def help_find(self):
        print "Find an entry (specify a name)"
    def do_find(self, name):
        if name == "": name = raw_input("Enter Name: ")
        if self.people.has_key(name):
            print "The number for %s is %s." % (name, self.people[name])
        else:
            print "We have no record for %s." % (name,)
 
    def help_list(self):
        print "Prints the contents of the directory"
    def do_list(self, line):        
        names = self.people.keys()         # the keys are the names
        if names == []: return             # if there are no names, exit
        names.sort()                       # we want them in alphabetic order
        print '='*41
        for name in names:
           print string.rjust(name, 20), ":", string.ljust(self.people[name], 20)
        print '='*41
 
    def help_EOF(self):
        print "Quits the program"
    def do_EOF(self, line):
        sys.exit()
 
    def help_save(self):
        print "save the current state of affairs"
    def do_save(self, filename):
        if filename == "": filename = raw_input("Enter filename: ")
        saveFile = open(filename, 'w')
        pickle.dump(self.people, saveFile)
 
    def help_load(self):
        print "load a directory"
    def do_load(self, filename):
        if filename == "": filename = raw_input("Enter filename: ")
        saveFile = open(filename, 'r')
        self.people = pickle.load(saveFile) # note that this will override
                                            # any existing people directory
 
if _ _name_ _ == '_ _main_ _':               # this way the module can be
    rolo = Rolodex()                     # imported by other programs as well
    rolo.cmdloop()                             

So, the people instance variable is a simple mapping between names and phone numbers that the add and find commands use. Commands are the methods which start with do_ , and their help is given by the corresponding help_ methods. Finally, the load and save commands use the pickle module, which is explained in more detail in Chapter 10, Frameworks and Applications.

How Does the Cmd Class Work, Anyway?

To understand how the Cmd class works, read the cmd module in the standard Python library you've already installed on your computer.

The Cmd interpreter does most of the work we're interested in its onecmd() method, which is called whenever a line is entered by the user. This method figures out the first word of the line that corresponds to a command (e.g., help, find, save, load, etc.). It then looks to see if the instance of the Cmd subclass has an attribute with the right name (if the command was "find tom", it looks for an attribute called do_find). If it finds this attribute, it calls it with the arguments to the command (in this case 'tom'), and returns the result. Similar magic is done by the do_help() method, which is invoked by this same mechanism, which is why it's called do_help()! The code for the onecmd() method once looked like this (the version you have may have had features added):


# onecmd method of Cmd class, see Lib/cmd.py
def onecmd(self, line):         # line is something like "find tom"
    line = string.strip(line)   # get rid of extra whitespace
    if not line:                # if there is nothing left, 
        line = self.lastcmd     # redo the last command
    else:
        self.lastcmd = line     # save for next time
    i, n = 0, len(line)
                                # next line finds end of first word
    while i < n and line[i] in self.identchars: i = i+1
                                # split line into command + arguments
    cmd, arg = line[:i], string.strip(line[i:])
    if cmd == '':               # happens if line doesn't start with A-z
        return self.default(line)
    else:                       # cmd is 'find', line is 'tom'
        try:
            func = getattr(self, 'do_' + cmd)  # look for method
        except AttributeError:
            return self.default(line)
        return func(arg)         # call method with the rest of the line

This example demonstrates the power of Python that comes from extending existing modules. The cmd module takes care of the prompt, help facility, and parsing of the input. The pickle module does all the loading and saving that can be so difficult in lesser languages. All we had to write were the parts specific to the task at hand. The generic aspect, namely an interactive interpreter, came free.

Exercises

This chapter is full of programs we encourage you to type in and play with. However, if you really want exercises, here are a few more challenging ones:

  1. Redirecting stdout. Modify the mygrep.py script to output to the last file specified on the command line instead of to the console.
  2. Writing a shell. Using the Cmd class in the cmd module and the functions listed in Chapter 8 for manipulating files and directories, write a little shell that accepts the standard Unix commands (or DOS commands if you'd rather): ls (dir) for listing the current directory, cd for changing directory, mv (or ren) for moving/renaming a file, and cp (copy) for copying a file.
  3. Understanding map, reduce, and filter. The map, reduce, and filter functions are somewhat difficult to understand if it's the first time you've encountered this type of function, partly because they involve passing functions as arguments, and partly because they do a lot even with such small names. One good way to ensure you know how they work is to rewrite them; in this exercise, write three functions (map2, reduce2, filter2), that do the same thing as map, filter, and reduce, respectively, at least as far as we've described how they work:
  4. map2 takes two arguments. The first should be a function accepting two arguments, or None. The second should be a sequence. If the first argument is a function, that function is called with each element of the sequence, and the resulting values are returned in a list. If the first argument is None, the sequence is converted to a list, and that list is returned.
  5. reduce2 takes two arguments. The first must be a function accepting two arguments, and the second must be a sequence. The first two arguments of the sequence are used as arguments to the function, and the result of that call is sent as the first argument to the function again, with the third element to the sequence as the second argument, and so on, until all elements of the sequence have been used as arguments to the function. The last returned value from the function is then the return value for the reduce2 call.
  6. filter2 takes two arguments. The first can be None or a function accepting two arguments. The second must be a sequence. If the first argument is None, filter2 returns the subset of the elements in the sequence that tests true. If the first argument is a function, filter2 is called with every element in the sequence in turn, and only those elements for which the return value of the function applied to them is true are returned by filter2.

1. Some objects don't qualify as "reasonably copyable," such as modules, file objects, and sockets. Remember that file objects are different from files on disk.

2. The random module provides many other useful functions, such as the random function, which returns a random floating-point number between 0 and 1. Check a reference source for details.

3. It turns out that map can do more; for example, if None is the first argument, map converts the sequence that is its second argument to a list. It can also operate on more than one sequence at a time. Check a reference source for details.

4. Runtime binding means that Python doesn't know which sort of object implements an interface until the program is running. This behavior stems from the lack of type declarations in Python and leads to the notion of polymorphism; in Python, the meaning of a object operation (such as indexing, slicing, etc.) depends on the object being operated on.

5. Two important compatibility comments: the win32pipe module also has a popen2 call, which is like the popen2 call on Unix, except that it returns the read and write pipes in swapped order (see the documentation for popen2 in the posix module for details on its interface). There is no equivalent of popen on Macs, since pipes don't exist on that operating system.

6. XML (eXtensible Markup Language) is a language for marking up structured text files that emphasizes the structure of the document, not its graphical nature. XML processing is an entirely different area of Python text processing, with much ongoing work. See Appendix A, Python Resources, for some pointers to discussion groups and software.


oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy

© 2001, O'Reilly & Associates, Inc.