|
|
|
|
Learning PythonBy Mark Lutz & David Ascher1st Edition April 1999 1-56592-464-9, Order Number: 4649 384 pages, $34.95 |
Sample Chapter 9: Common Tasks in Python
In this chapter:
Data Structure Manipulations
Manipulating Files
Manipulating Programs
Internet-Related Activities
Bigger Examples
Exercises
At this point, we have covered the syntax of Python, its basic data types, and many of our favorite functions in the Python library. This chapter assumes that all the basic components of the language are at least understood and presents some ways in which Python is, in addition to being elegant and "cool," just plain useful. We present a variety of tasks common to Python programmers. These tasks are grouped by categories--data structure manipulations, file manipulations, etc.
Data Structure Manipulations
One of Python's greatest features is that it provides the list, tuple, and dictionary built-in types. They are so flexible and easy to use that once you've grown used to them, you'll find yourself reaching for them automatically.
Making Copies Inline
Due to Python's reference management scheme, the statement
a = bdoesn't make a copy of the object referenced byb; instead, it makes a new reference to that object. Sometimes a new copy of an object, not just a shared reference, is needed. How to do this depends on the type of the object in question. The simplest way to make copies of lists and tuples is somewhat odd. IfmyListis a list, then to make a copy of it, you can do:newList = myList[:]which you can read as "slice from beginning to end," since you'll remember from Chapter 2, Types and Operators, that the default index for the start of a slice is the beginning of the sequence (0), and the default index for the end of a slice is the end of sequence. Since tuples support the same slicing operation as lists, this same technique can also copy tuples. Dictionaries, on the other hand, don't support slicing. To make a copy of a dictionary
myDict, you can use:newDict = {} for key in myDict.keys(): newDict[key] = myDict[key]This is such a common task that a new method was added to the dictionary object in Python 1.5, the
copy()method, which performs this task. So the preceding code can be replaced with the single statement:newDict = myDict.copy()Another common dictionary operation is also now a standard dictionary feature. If you have a dictionary
oneDict, and want to update it with the contents of a different dictionaryotherDict, simply typeoneDict.update(otherDict). This is the equivalent of:for key in otherDict.keys(): oneDict[key] = otherDict[key]If
oneDictshared some keys withotherDictbefore theupdate()operation, the old values associated with the keys inoneDictare obliterated by the update. This may be what you want to do (it usually is, which is why this behavior was chosen and why it was called "update"). If it isn't, the right thing to do might be to complain (raise an exception), as in:def mergeWithoutOverlap(oneDict, otherDict): newDict = oneDict.copy() for key in otherDict.keys(): if key in oneDict.keys(): raise ValueError, "the two dictionaries are sharing keys!" newDict[key] = otherDict[key] return newDictor, alternatively, combine the values of the two dictionaries, with a tuple, for example:
def mergeWithOverlap(oneDict, otherDict): newDict = oneDict.copy() for key in otherDict.keys(): if key in oneDict.keys(): newDict[key] = oneDict[key], otherDict[key] else: newDict[key] = otherDict[key] return newDictTo illustrate the differences between the preceding three algorithms, consider the following two dictionaries:
phoneBook1 = {'michael': '555-1212', 'mark': '554-1121', 'emily': '556-0091'} phoneBook2 = {'latoya': '555-1255', 'emily': '667-1234'}If
phoneBook1is possibly out of date, andphoneBook2is more up to date but less complete, the right usage is probablyphoneBook1.update(phoneBook2). If the two phoneBooks are supposed to have nonoverlapping sets of keys, usingnewBook = mergeWithoutOverlap(phoneBook1, phoneBook2)lets you know if that assumption is wrong. Finally, if one is a set of home phone numbers and the other a set of office phone numbers, chances arenewBook=mergeWithOverlap(phoneBook1, phoneBook2)is what you want, as long as the subsequent code that usesnewBookcan deal with the fact thatnewBook['emily']is the tuple('556-0091','667-1234').Making Copies: The copy Module
Back to making copies: the
[:]and.copy()tricks will get you copies in 90% of the cases. If you are writing functions that, in true Python spirit, can deal with arguments of any type, it's sometimes necessary to make copies of X, regardless of what X is. In comes thecopymodule. It provides two functions,copyanddeepcopy. The first is just like the[:]sequence slice operation or thecopymethod of dictionaries. The second is more subtle and has to do with deeply nested structures (hence the term deepcopy). Take the example of copying a listlistOneby slicing it from beginning to end using the[:]construct. This technique makes a new list that contains references to the same objects contained in the original list. If the contents of that original list are immutable objects, such as numbers or strings, the copy is as good as a "true" copy. However, suppose that the first element inlistOneis itself a dictionary (or any other mutable object). The first element of the copy oflistOneis a new reference to the same dictionary. So if you then modify that dictionary, the modification is evident in bothlistOneand the copy oflistOne. An example makes it much clearer:>>> import copy >>> listOne = [{"name": "Willie", "city": "Providence, RI"}, 1, "tomato", 3.0] >>> listTwo = listOne[:] # or listTwo=copy.copy(listOne) >>> listThree = copy.deepcopy(listOne) >>> listOne.append("kid") >>> listOne[0]["city"] = "San Francisco, CA" >>> print listOne, listTwo, listThree [{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0, 'kid'] [{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0] [{'name': 'Willie', 'city': 'Providence, RI'}, 1, 'tomato', 3.0]As you can see, modifying
listOnedirectly modified onlylistOne. Modifying the first entry of the list referenced bylistOneled to changes inlistTwo, but not inlistThree; that's the difference between a shallow copy ([:]) and a deepcopy. Thecopymodule functions know how to copy all the built-in types that are reasonably copyable,[1] including classes and instances.Sorting and Randomizing
In Chapter 2, you saw that lists have a sort method that does an in-place sort. Sometimes you want to iterate over the sorted contents of a list, without disturbing the contents of this list. Or you may want to list the sorted contents of a tuple. Because tuples are immutable, an operation such as
sort, which modifies it in place, is not allowed. The only solution is to make a list copy of the elements, sort the list copy, and work with the sorted copy, as in:listCopy = list(myTuple) listCopy.sort() for item in listCopy: print item # or whatever needs doingThis solution is also the way to deal with data structures that have no inherent order, such as dictionaries. One of the reasons that dictionaries are so fast is that the implementation reserves the right to change the order of the keys in the dictionary. It's really not a problem, however, given that you can iterate over the keys of a dictionary using an intermediate copy of the keys of the dictionary:
keys = myDict.keys() # returns an unsorted list of # the keys in the dict keys.sort() for key in keys: # print key, value pairs print key, myDict[key] # sorted by keyThe
sortmethod on lists uses the standard Python comparison scheme. Sometimes, however, that scheme isn't what's needed, and you need to sort according to some other procedure. For example, when sorting a list of words, case (lower versus UPPER) may not be significant. The standard comparison of text strings, however, says that all uppercase letters "come before" all lowercase letters, so 'Baby'is "less than" 'apple'but 'baby'is "greater than" 'apple'. In order to do a case-independent sort, you need to define a comparison function that takes two arguments, and returns -1,0, or1depending on whether the first argument is smaller than, equal to, or greater than the second argument. So, for our case-independent sorting, you can use:>>> def caseIndependentSort(something, other): ... something, other = string.lower(something), string.lower(other) ... return cmp(something, other) ... >>> testList = ['this', 'is', 'A', 'sorted', 'List'] >>> testList.sort() >>> print testList ['A', 'List', 'is', 'sorted', 'this'] >>> testList.sort(caseIndependentSort) >>> print testList ['A', 'is', 'List', 'sorted', 'this']We're using the built-in function
cmp, which does the hard part of figuring out that'a'comes before'b','b'before'c', etc. Our sort function simply lowercases both items and sorts the lowercased versions, which is one way of making the comparison case-independent. Also note that the lowercasing conversion is local to the comparison function, so the elements in the list aren't modified by the sort.Randomizing: The random Module
What about randomizing a sequence, such as a list of lines? The easiest way to randomize a sequence is to repeatedly use the
choicefunction in therandommodule, which returns a random element from the sequence it receives as an argument.[2] In order to avoid getting the same line multiple times, remember to remove the chosen item. When manipulating a list object, use theremovemethod:while myList: # will stop looping when myList is empty element = random.choice(myList) myList.remove(element) print element,If you need to randomize a nonlist object, it's usually easiest to convert that object to a list and randomize the list version of the same data, rather than come up with a new strategy for each data type. This might seem a wasteful strategy, given that it involves building intermediate lists that might be quite large. In general, however, what seems large to you probably won't seem so to the computer, thanks to the reference system. Also, consider the time saved by not having to come up with a different strategy for each data type! Python is designed to save time; if that means running a slightly slower or bigger program, so be it. If you're handling enormous amounts of data, it may be worthwhile to optimize. But never optimize until the need for optimization is clear; that would be a waste of time.
Making New Data Structures
The last point about not reinventing the wheel is especially true when it comes to data structures. For example, Python lists and dictionaries might not be the lists and dictionaries or mappings you're used to, but you should avoid designing your own data structure if these structures will suffice. The algorithms they use have been tested under wide ranges of conditions, and they're fast and stable. Sometimes, however, the interface to these algorithms isn't convenient for a particular task.
For example, computer-science textbooks often describe algorithms in terms of other data structures such as queues and stacks. To use these algorithms, it may make sense to come up with a data structure that has the same methods as these data structures (such as
popandpushfor stacks orenqueue/dequeuefor queues). However, it also makes sense to reuse the built-in list type in the implementation of a stack. In other words, you need something that acts like a stack but is based on a list. The easiest solution is to use a class wrapper around a list. For a minimal stack implementation, you can do this:class Stack: def _ _init_ _(self, data): self._data = list(data) def push(self, item): self._data.append(item) def pop(self): item = self._data[-1] del self._data[-1] return itemThe following is simple to write, to understand, to read, and to use:
>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid']) >>> thingsToDo.push('do the dishes') >>> print thingsToDo.pop() do the dishes >>> print thingsToDo.pop() wash the kidTwo standard Python naming conventions are used in the
Stackclass above. The first is that class names start with an uppercase letter, to distinguish them from functions. The other is that the_dataattribute starts with an underscore. This is a half-way point between public attributes (which don't start with an underscore), private attributes (which start with two underscores; see Chapter 6, Classes), and Python-reserved identifiers (which both start and end with two underscores). What it means is that_datais an attribute of the class that shouldn't be needed by clients of the class. The class designer expects such "pseudo-private" attributes to be used only by the class methods and by the methods of any eventual subclass.Making New Lists and Dictionaries: The UserList and UserDict Modules
The
Stackclass presented earlier does its minimal job just fine. It assumes a fairly minimal definition of what a stack is, specifically, something that supports just two operations, apushand apop. Quickly, however, you find that some of the features of lists are really nice, such as the ability to iterate over all the elements using thefor...in...construct. This can be done by reusing existing code. In this case, you should use theUserListclass defined in theUserListmodule as a class from which theStackcan be derived. The library also includes aUserDictmodule that is a class wrapper around a dictionary. In general, they are there to be specialized by subclassing. In our case:# import the UserList class from the UserList module from UserList import UserList # subclass the UserList class class Stack(UserList): push = UserList.append def pop(self): item = self[-1] # uses _ _getitem_ _ del self[-1] return itemThis
Stackis a subclass of theUserListclass. TheUserListclass implements the behavior of the[]brackets by defining the special _ _getitem_ _ and _ _delitem__methods among others, which is why the code inpopworks. You don't need to define your own _ _init_ _ method becauseUserListdefines a perfectly good default. Finally, thepushmethod is defined just by saying that it's the same asUserList'sappendmethod. Now we can do list-like things as well as stack-like things:>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid']) >>> print thingsToDo # inherited from UserList ['write to mom', 'invite friend over', 'wash the kid'] >>> thingsToDo.pop() 'wash the kid' >>> thingsToDo.push('change the oil') >>> for chore in thingsToDo: # we can also iterate over the contents ... print chore # as "for .. in .." uses _ _getitem_ _ ... write to mom invite friend over change the oilNOTE: As this book was being written, Guido van Rossum announced that in Python 1.5.2 (and subsequent versions), list objects now have an additional method called
pop, which behaves just like the one here. It also has an optional argument that specifies what index to use to do the pop (with the default being the last element in the list).Manipulating Files
Scripting languages were designed in part in order to help people do repetitive tasks quickly and simply. One of the common things webmasters, system administrators, and programmers need to do is to take a set of files, select a subset of those files, do some sort of manipulation on this subset, and write the output to one or a set of output files. (For example, in each file in a directory, find the last word of every other line that starts with something other than the
#character, and print it along with the name of the file.) This is a task for which special-purpose tools have been developed, such as sed and awk. We find that Python does the job just fine using very simple tools.Doing Something to Each Line in a File
The
sysmodule is most helpful when it comes to dealing with an input file, parsing the text it contains and processing it. Among its attributes are three file objects, calledsys.stdin,sys.stdout, andsys.stderr. The names come from the notion of the three streams, called standard in, standard out, and standard error, which are used to connect command line tools. Standard output (stdout) is used by everywriteandwritelines. The other often-used stream is standard in (stdin), which is also a file object, but with the input methods, such asread,readline, andreadlines. For example, the following script counts all the lines in the file that is "piped in":import sys data = sys.stdin.readlines() print "Counted", len(data), "lines."On Unix, you could test it by doing something like:
% cat countlines.py | python countlines.py Counted 3 lines.On Windows or DOS, you'd do:
C:\> type countlines.py | python countlines.py Counted 3 lines.The
readlinesfunction is useful when implementing simple filter operations. Here are a few examples of such filter operations:
- Finding all lines that start with a #
import sys for line in sys.stdin.readlines(): if line[0] == '#': print line,- Note that a final comma is needed after the
linestring already includes a newline character as its last character.- Extracting the fourth column of a file (where columns are defined by whitespace)
import sys, string for line in sys.stdin.readlines(): words = string.split(line) if len(words) >= 4: print words[3]- We look at the length of the words list to find if there are indeed at least four words. The last two lines could also be replaced by the try/except idiom, which is quite common in Python:
try: print words[3] except IndexError: # there aren't enough words pass- Extracting the fourth column of a file, where columns are separated by colons, and lowercasing it
import sys, string for line in sys.stdin.readlines(): words = string.split(line, ':') if len(words) >= 4: print string.lower(words[3])- Printing the first 10 lines, the last 10 lines, and every other line
import sys, string lines = sys.stdin.readlines() sys.stdout.writelines(lines[:10]) # first ten lines sys.stdout.writelines(lines[-10:]) # last ten lines for lineIndex in range(0, len(lines), 2): # get 0, 2, 4, ... sys.stdout.write(lines[lineIndex]) # get the indexed line- Counting the number of times the word "Python" occurs in a file
import string text = open(fname).read() print string.count(text, 'Python')- Changing a list of columns into a list of rows
- In this more complicated example, the task is to "transpose" a file; imagine you have a file that looks like:
Name: Willie Mark Guido Mary Rachel Ahmed Level: 5 4 3 1 6 4 Tag#: 1234 4451 5515 5124 1881 5132- And you really want it to look like the following instead:
Name: Level: Tag#: Willie 5 1234 Mark 4 4451 ...- You could use code like the following:
import sys, string lines = sys.stdin.readlines() wordlists = [] for line in lines: words = string.split(line) wordlists.append(words) for row in range(len(wordlists[0])): for col in range(len(wordlists)): print wordlists[col][row] + '\t', print- Of course, you should really use much more defensive programming techniques to deal with the possibility that not all lines have the same number of words in them, that there may be missing data, etc. Those techniques are task-specific and are left as an exercise to the reader.
Choosing chunk sizes
All the preceding examples assume you can read the entire file at once (that's what the
readlinescall expects). In some cases, however, that's not possible, for example when processing really huge files on computers with little memory, or when dealing with files that are constantly being appended to (such as log files). In such cases, you can use awhile/readlinecombination, where some of the file is read a bit at a time, until the end of file is reached. In dealing with files that aren't line-oriented, you must read the file a character at a time:# read character by character while 1: next = sys.stdin.read(1) # read a one-character string if not next: # or an empty string at EOF break Process character 'next'Notice that the
read()method on file objects returns an empty string at end of file, which breaks out of thewhileloop. Most often, however, the files you'll deal with consist of line-based data and are processed a line at a time:# read line by line while 1: next = sys.stdin.readline() # read a one-line string if not next: # or an empty string at EOF break Process line 'next'Doing Something to a Set of Files Specified on the Command Line
Being able to read
stdinis a great feature; it's the foundation of the Unix toolset. However, one input is not always enough: many tasks need to be performed on sets of files. This is usually done by having the Python program parse the list of arguments sent to the script as command-line options. For example, if you type:% python myScript.py input1.txt input2.txt input3.txt output.txtyou might think that myScript.py wants to do something with the first three input files and write a new file, called output.py. Let's see what the beginning of such a program could look like:
import sysinputfilenames, outputfilename = sys.argv[1:-1], sys.argv[-1]for inputfilename in inputfilenames: inputfile = open(inputfilename, "r") do_something_with_input(inputfile) outputfile = open(outputfilename, "w") write_results(outputfile)The second line extracts parts of the
argvattribute of thesysmodule. Recall that it's a list of the words on the command line that called the current program. It starts with the name of the script. So, in the example above, the value ofsys.argvis:['myScript.py','input1.txt','input2.txt','input3.txt','output.txt'].The script assumes that the command line consists of one or more input files and one output file. So the slicing of the input file names starts at 1 (to skip the name of the script, which isn't an input to the script in most cases), and stops before the last word on the command line, which is the name of the output file. The rest of the script should be pretty easy to understand (but won't work until you provide the
do_something_with_input()andwrite_results()functions).Note that the preceding script doesn't actually read in the data from the files, but passes the file object down to a function to do the real work. Such a function often uses the
readlines()method on file objects, which returns a list of the lines in that file. A generic version ofdo_something_with_input()is:def do_something_with_input(inputfile): for line in inputfile.readlines() process(line)Processing Each Line of One or More Files:
The fileinput ModuleThe combination of this idiom with the preceding one regarding opening each file in the
sys.argv[1:]list is so common that Python 1.5 introduced a new module that's designed to help do just this task. It's calledfileinputand works like this:import fileinput for line in fileinput.input(): process(line)The
fileinput.input()call parses the arguments on the command line, and if there are no arguments to the script, usessys.stdininstead. It also provides a bunch of useful functions that let you know which file and line number you're currently manipulating:import fileinput, sys, string # take the first argument out of sys.argv and assign it to searchterm searchterm, sys.argv[1:] = sys.argv[1], sys.argv[2:] for line in fileinput.input(): num_matches = string.count(line, searchterm) if num_matches: # a nonzero count means there was a match print "found '%s' %d times in %s on line %d." % (searchterm, num_matches, fileinput.filename(), fileinput.filelineno())If this script were called mygrep.py, it could be used as follows:
% python mygrep.py in *.py found 'in' 2 times in countlines.py on line 2. found 'in' 2 times in countlines.py on line 3. found 'in' 2 times in mygrep.py on line 1. found 'in' 4 times in mygrep.py on line 4. found 'in' 2 times in mygrep.py on line 5. found 'in' 2 times in mygrep.py on line 7. found 'in' 3 times in mygrep.py on line 8. found 'in' 3 times in mygrep.py on line 12.Filenames and Directories
We have now covered reading existing files, and if you remember the discussion on the
openbuilt-in function in Chapter 2, you know how to create new files. There are a lot of tasks, however, that need different kinds of file manipulations, such as directory and path management and removing files. Your two best friends in such cases are theosandos.pathmodules described in Chapter 8, Built-in Tools.Let's take a typical example: you have lots of files, all of which have a space in their name, and you'd like to replace the spaces with underscores. All you really need is the
os.curdirattribute (which returns an operating-system specific string that corresponds to the current directory), theos.listdirfunction (which returns the list of filenames in a specified directory), and theos.renamefunction:import os, string if len(sys.argv) == 1: # if no filenames are specified, filenames = os.listdir(os.curdir) # use current dir else: # otherwise, use files specified filenames = sys.argv[1:] # on the command line for filename in filenames: if ' ' in filename: newfilename = string.replace(filename, ' ', '_') print "Renaming", filename, "to", newfilename, "..." os.rename(filename, newfilename)This program works fine, but it reveals a certain Unix-centrism. That is, if you call it with wildcards, such as:
python despacify.py *.txtyou find that on Unix machines, it renames all the files with names with spaces in them and that end with .txt. In a DOS-style shell, however, this won't work because the shell normally used in DOS and Windows doesn't convert from *.txt to the list of filenames; it expects the program to do it. This is called globbing, because the
*is said to match a glob of characters.Matching Sets of Files: The glob Module
The
globmodule exports a single function, also calledglob, which takes a filename pattern and returns a list of all the filenames that match that pattern (in the current working directory):import sys, glob, operator print sys.argv[1:] sys.argv = reduce(operator.add, map(glob.glob, sys.argv)) print sys.argv[1:]Running this on Unix and DOS shows that on Unix, the Python
globdidn't do anything because the globbing was done by the Unix shell before Python was invoked, and on DOS, Python's globbing came up with the same answer:/usr/python/book$ python showglob.py *.py ['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py'] ['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py'] C:\python\book> python showglob.py *.py ['*.py'] ['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']This script isn't trivial, though, because it uses two conceptually difficult operations; a
mapfollowed by areduce.mapwas mentioned in Chapter 4, Functions, butreduceis new to you at this point (unless you have background in LISP-type languages).mapis a function that takes a callable object (usually a function) and a sequence, calls the callable object with each element of the sequence in turn, and returns a list containing the values returned by the function. For an graphical representation of whatmapdoes, see Figure 9-1. [3]
Figure 9-1. Graphical representation of the behavior of the map built-in
![]()
mapis needed here (or something equivalent) because you don't know how many arguments were entered on the command line (e.g., it could have been*.py *.txt*.doc). So theglob.globfunction is called with each argument in turn. Eachglob.globcall returns a list of filenames that match the pattern. Themapoperation then returns a lists of lists, which you need to convert to a single list--the combination of all the lists in this list of lists. That means doinglist1+list2+ ...+listN. That's exactly the kind of situation where thereducefunction comes in handy.Just as with
map,reducetakes a function as its first argument and applies it to the first two elements of the sequence it receives as its second argument. It then takes the result of that call and calls the function again with that result and the next element in the sequence, etc. (See Figure 9-2 for an illustration ofreduce.) But wait: you need+applied to a set of things, and+doesn't look like a function (it isn't). So a function is needed that works the same as+. Here's one:define myAdd(something, other): return something + otherYou would then use
reduce(myAdd, map(...)). This works fine, but better yet, you can use theaddfunction defined in theoperatormodule, which does the same thing. Theoperatormodule defines functions for every syntactic operation in Python (including attribute-getting and slicing), and you should use those instead of homemade ones for two reasons. First, they've been coded, debugged, and tested by Guido, who has a pretty good track record at writing bugfree code. Second, they're actually C functions, and applyingreduce(ormap, orfilter) to C functions results in much faster performance than applying it to Python functions. This clearly doesn't matter when all you're doing is going through a few hundred files once. If you do thousands of globs all the time, however, speed can become an issue, and now you know how to do it quickly.
Figure 9-2. Graphical representation of the behavior of the reduce built-in
![]()
The
filterbuilt-in function, likemapandreduce, takes a function and a sequence as arguments. It returns the subset of the elements in the sequence for which the specified function returns something that's true. To find all of the even numbers in a set, type this:>>> numbers = range(30) >>> def even(x): ... return x % 2 == 0 ... >>> print numbers [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29] >>> print filter(even, numbers) [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]Or, if you wanted to find all the words in a file that are at least 10 characters long, you could use:
import string words = string.split(open('myfile.txt').read()) # get all the words def at_least_ten(word): return len(word) >= 10 longwords = filter(at_least_ten, words)For a graphical representation of what
filterdoes, see Figure 9-3. One nice special feature offilteris that if one passesNoneas the first argument, it filters out all false entries in the sequence. So, to find all the nonempty lines in a file called myfile.txt, do this:lines = open('myfile.txt').readlines() lines = filter(None, lines) # remember, the empty string is false
map,filter, andreduceare three powerful constructs, and they're worth knowing about; however, they are never necessary. It's fairly simple to write a Python function that does the same thing as any of them. The built-in versions are just as fast, especially when operating on built-in functions written in C, such as the functions in theoperatormodule.
Figure 9-3. Graphical representation of the behavior of the filter built-in
![]()
Using Temporary Files
If you've ever written a shell script and needed to use intermediary files for storing the results of some intermediate stages of processing, you probably suffered from directory litter. You started out with 20 files called log_001.txt, log_002.txt etc., and all you wanted was one summary file called log_sum.txt. In addition, you had a whole bunch of log_001.tmp, log_001.tm2, etc. files that, while they were labeled temporary, stuck around. At least that's what we've seen happen in our own lives. To put order back into your directories, use temporary files in specific directories and clean them up afterwards.
To help in this temporary file-management problem, Python provides a nice little module called
tempfilethat publishes two functions:mktemp()andTemporaryFile(). The former returns the name of a file not currently in use in a directory on your computer reserved for temporary files (such as /tmp on Unix or C:\TMP on Windows). The latter returns a new file object directly. For example:# read input file inputFile = open('input.txt', 'r') import tempfile # create temporary file tempFile = tempfile.TemporaryFile() # we don't even need to first_process(input = inputFile, output = tempFile) # know the filename... # create final output file outputFile = open('output.txt', 'w') second_process(input = tempFile, output = outputFile)
Using tempfile.TemporaryFile()works well in cases where the intermediate steps manipulate file objects. One of its nice features is that when it's deleted, it automatically deletes the file it created on disk, thus cleaning up after itself. One important use of temporary files, however, is in conjunction with theos.systemcall, which means using a shell, hence using filenames, not file objects. For example, let's look at a program that creates form letters and mails them to a list of email addresses (on Unix only):formletter = """Dear %s,\nI'm writing to you to suggest that ...""" # etc. myDatabase = [('Bill Clinton', 'bill@whitehouse.gov.us'), ('Bill Gates', 'bill@microsoft.com'), ('Bob', 'bob@subgenius.org')] for name, email in myDatabase: specificLetter = formletter % name tempfilename = tempfile.mktemp() tempfile = open(tempfilename, 'w') tempfile.write(specificLetter) tempfile.close() os.system('/usr/bin/mail %(email)s -s "Urgent!" < %(tempfile)s' % vars()) os.remove(tempfilename)The first line in the
forloop returns a customized version of the form letter based on the name it's given. That text is then written to a temporary file that's emailed to the appropriate email address using theos.systemcall (which we'll cover later in this chapter). Finally, to clean up, the temporary file is removed. If you forgot how the%bit works, go back to Chapter 2 and review it; it's worth knowing. Thevars()function is a built-in function that returns a dictionary corresponding to the variables defined in the current local namespace. The keys of the dictionary are the variable names, and the values of the dictionary are the variable values.vars()comes in quite handy for exploring namespaces. It can also be called with an object as an argument (such as a module, a class, or an instance), and it will return the namespace of that object. Two other built-ins,locals()andglobals(), return the local and global namespaces, respectively. In all three cases, modifying the returned dictionaries doesn't guarantee any effect on the namespace in question, so view these as read-only and you won't be surprised. You can see that thevars()call creates a dictionary that is used by the string interpolation mechanism; it's thus important that the names inside the%(...)sbits in the string match the variable names in the program.More on Scanning Text Files
Suppose you've run a program that stores its output in a text file, which you need to load. The program creates a file that's composed of a series of lines that each contain a value and a key separated by whitespace:
value key value key value key and so on...A key can appear on more than one line in the file, and you'd probably like to collect all the values that appear for each given key as you scan the file. Here's one way to solve this problem:
#!/usr/bin/env python import sys, string entries = {} for line in open(sys.argv[1], 'r').readlines(): left, right = string.split(line) try: entries[right].append(left) # extend list except KeyError: entries[right] = [left] # first time seen for (right, lefts) in entries.items(): print "%04d '%s'\titems => %s" % (len(lefts), right, lefts)This script uses the
readlinesmethod to scan the text file line by line, and calls the built-instring.splitfunction to chop the line into a list of substrings--a list containing the value and key strings separated by blanks or tabs in the file. To store all occurrences of a key, the script uses a dictionary calledentries. Thetrystatement in the loop tries to add new values to an existing entry for a key; if no entry exists for the key, it creates one. Notice that thetrycould be replaced with anifhere:if entries.has_key(right): # is it already in the dictionary? entries[right].append(left) # add to the list of current values for key else: entries[right] = [left] # initialize key's values listTesting whether a dictionary contains a key is sometimes faster than catching an exception with the
trytechnique; it depends on how many times the test is true. Here's an example of this script in action. The input filename is passed in as a command-line argument (sys.argv[1]):% cat data.txt 1 one 2 one 3 two 7 three 8 two 10 one 14 three 19 three 20 three 30 three % python collector1.py data.txt 0003 'one' items => ['1', '2', '10'] 0005 'three' items => ['7', '14', '19', '20', '30'] 0002 'two' items => ['3', '8']You can make this code more useful by packaging the scanner logic in a function that returns the
entriesdictionary as a result and wrapping the printing loop logic at the bottom in aniftest:#!/usr/bin/env python import sys, string def collect(file): entries = {} for line in file.readlines(): left, right = string.split(line) try: entries[right].append(left) # extend list except KeyError: entries[right] = [left] # first time seen return entries if _ _name_ _ == "_ _main_ _": # when run as a script if len(sys.argv) == 1: result = collect(sys.stdin) # read from stdin stream else: result = collect(open(sys.argv[1], 'r')) # read from passed filename for (right, lefts) in result.items(): print "%04d '%s'\titems => %s" % (len(lefts), right, lefts)This way, the program becomes a bit more flexible. By using the
if_ _name_ _== "_ _main_ _"trick, you can still run it as a top-level script (and get a display of theresults), or import the function it defines and process the resulting dictionary explicitly:# run as a script file % collector2.py < data.txt result displayed here... # use in some other component (or interactively) from collector2 import collect result = collect(open("spam.txt", "r")) process result here...Since the
collectfunction accepts an open file object, it also works on any object that provides the methods (i.e., interface) built-in files do. For example, if you want to read text from a simple string, wrap it in a class that implements the required interface and pass an instance of the class to thecollectfunction:>>> from collector2 import collect >>> from StringIO import StringIO >>> >>> str = StringIO("1 one\n2 one\n3 two") >>> result = collect(str) # scans the wrapped string >>> print result # {'one':['1','2'],'two':['3']}This code uses the
StringIOclass in the standard Python library to wrap the string into an instance that has all the methods file objects have; see the Library Reference for more details onStringIO. You could also write a different class or subclass fromStringIOif you need to modify its behavior. Regardless, thecollectfunction happily reads text from the stringstr, which happens to be an in-memory object, not a file.The main reason all this works is that the
collectfunction was designed to avoid making assumptions about the type of object itsfileparameter references. As long as the object exports areadlinesmethod that returns a list of strings,collectdoesn't care what type of object it processes. The interface is all that matters. This runtime binding[4] is an important feature of Python's object system, and allows you to easily write component programs that communicate with other components. For instance, consider a program that reads and writes satellite telemetry data using the standard file interface. By plugging in an object with the right sort of interface, you can redirect its streams to live sockets, GUI boxes, web interfaces, or databases without changing the program itself or even recompiling it.Manipulating Programs
Calling Other Programs
Python can be used like a shell scripting language, to steer other tools by calling them with arguments the Python program determines at runtime. So, if you have to run a specific program (call it
analyzeData) with various data files and various parameters specified on the command line, you can use theos.system()call, which takes a string specifying a command to run in a subshell. Specifically:for datafname in ['data.001', 'data.002', 'data.003']: for parameter1 in range(1, 10): os.system("analyzeData -in %(datafname)s -param1 %(paramter1)d" % vars())If
analyzeDatais a Python program, you're better off doing it without invoking a subshell; simply use theimportstatement up front and a function call in the loop. Not every useful program out there is a Python program, though.In the preceding example, the output of
analyzeDatais most likely either a file or standard out. If it's standard out, it would be nice to be able to capture its output. Thepopen()function call is an almost standard way to do this. We'll show it off in a real-world task.When we were writing this book, we were asked to avoid using tabs in source-code listings and use spaces instead. Tabs can wreak havoc with typesetting, and since indentation matters in Python, incorrect typesetting has the potential to break examples. But since old habits die hard (at least one of us uses tabs to indent his own Python code), we wanted a tool to find any tabs that may have crept into our code before it was shipped off for publication. The following script, findtabs.py, does the trick:
#!/usr/bin/env python # find files, search for tabs import string, os cmd = 'find . -name "*.py" -print' # find is a standard Unix tool for file in os.popen(cmd).readlines(): # run find command num = 1 name = file[:-1] # strip '\n' for line in open(name).readlines(): # scan the file pos = string.find(line, "\t") if pos >= 0: print name, num, pos # report tab found print '....', line[:-1] # [:-1] strips final \n print '....', ' '*pos + '*', '\n' num = num+1This script uses two nested
forloops. The outer loop usesos.popento run afindshell command, which returns a list of all the Python source filenames accessible in the current directory and its subdirectories. The inner loop reads each line in the current file, usingstring.findto look for tabs. But the real magic in this script is in the built-in tools it employs:
os.popen- Takes a shell command passed in as a string (called
cmdin the example) and returns a file-like object connected to the command's standard input or output streams. Output is the default if you don't pass an explicit"r"or"w"mode argument. By reading the file-like object, you can intercept the command's output as we did here--the result of thefind. It turns out that there's a module in the standard library calledfind.pythat provides a function that does a very similar thing to our use ofpopenwith thefindUnix command. As an exercise, you could rewrite findtabs.py to use it instead.string.find- Returns the index of the first occurrence of one string in another, searching from left to right. In the script, we use it to look for a tab, passed in as an (escaped) one-character string (
'\t').When a tab is found, the script prints the matching line, along with a pointer to where the tab occurs. Notice the use of string repetition: the expression
' '*posmoves the print cursor to the right, up to the index of the first tab. Use double quotes inside a single-quoted string without backslash escapes incmd. Here is the script at work, catching illegal tabs in the unfortunately named file happyfingers.py :C:\python\book-examples> python findtabs.py ./happyfingers.py 2 0 .... for i in range(10): .... * ./happyfingers.py 3 0 .... print "oops..." .... * ./happyfingers.py 5 5 .... print "bad style" .... *A note on portability: the
findshell command used in the findtabs script is a Unix command, which may or may not be available on other platforms (it ran under Windows in the listing above because afindutility program was installed).os.popenfunctionality is available aswin32pipe.popenin thewin32extensions to Python for Windows.[5] If you want to write code that catches shell command output portably, use something like the following code early in your script:import sys if sys.platform == "win32": # on a Windows port try: import win32pipe popen = win32pipe.popen except ImportError: raise ImportError, "The win32pipe module could not be found" else: # else on POSIX box import os popen = os.popen ...And use popen in blissful platform ignoranceThe
sys.platformattribute is always preset to a string that identifies the underlying platform (and hence the Python port you're using). Although the Python language isn't platform-dependent, some of its libraries may be; checkingsys.platformis the standard way to handle cases where they are. Notice the nestedimportstatements here; as we've seen,importis just an executable statement that assigns a variable name.Internet-Related Activities
The Internet is a treasure trove of information, but its exponential growth can make it hard to manage. Furthermore, most tools currently available for "surfing the Web" are not programmable. Many web-related tasks can be automated quite simply with the tools in the standard Python distribution.
Downloading a Web Page Programmatically
If you're interested in finding out what the weather in a given location is over a period of months, it's much easier to set up an automated program to get the information and collect it in a file than to have to remember to do it by hand.
Here is a program that finds the weather in a couple of cities and states using the pages of the weather.com web site:
import urllib, urlparse, string, time def get_temperature(country, state, city): url = urlparse.urljoin('http://www.weather.com/weather/cities/', string.lower(country)+'_' + \ string.lower(state) + '_' + \ string.replace(string.lower(city), ' ', '_') + '.html') data = urllib.urlopen(url).read() start = string.index(data, 'current temp: ') + len('current temp: ') stop = string.index(data, '°v;F', start-1) temp = int(data[start:stop]) localtime = time.asctime(time.localtime(time.time())) print ("On %(localtime)s, the temperature in %(city)s, " +\ "%(state)s %(country)s is %(temp)s F.") % vars() get_temperature('FR', '', 'Paris') get_temperature('US', 'RI', 'Providence') get_temperature('US', 'CA', 'San Francisco')When run, it produces output like:
~/book:> python get_temperature.py On Wed Nov 25 16:22:25 1998, the temperature in Paris, FR is 39 F. On Wed Nov 25 16:22:30 1998, the temperature in Providence, RI US is 39 F. On Wed Nov 25 16:22:35 1998, the temperature in San Francisco, CA US is 58 F.The code in get_temperature.py suffers from one flaw, which is that the logic of the URL creation and of the temperature extraction is dependent on the specific HTML produced by the web site you use. The day the site's graphic designer decides that "current temp:" should be spelled with capitalized words, this script won't work. This is a problem with programmatic parsing of web pages that will go away only when more structural formats (such as XML) are used to produce web pages.[6]
Checking the Validity of Links and Mirroring
Web Sites: webchecker.py and FriendsOne of the big hassles of maintaining a web site is that as the number of links in the site increases, so does the chance that some of the links will no longer be valid. Good web-site maintenance therefore includes periodic checking for such stale links. The standard Python distribution includes a tool that does just this. It lives in the Tools/webchecker directory and is called webchecker.py.
A companion program called websucker.py located in the same directory uses similar logic to create a local copy of a remote web site. Be careful when trying it out, because if you're not careful, it will try to download the entire Web on your machine! The same directory includes two programs called wsgui.py and webgui.py that are Tkinter-based frontends to websucker and webchecker, respectively. We encourage you to look at the source code for these programs to see how one can build sophisticated web-management systems with Python's standard toolset.
In the Tools/Scripts directory, you'll find many other small to medium-sized scripts that might be of interest, such as an equivalent of websucker.py for FTP servers called ftpmirror.py.
Checking Mail
Electronic mail is probably the most important medium on the Internet today; it's certainly the protocol with which most information passes between individuals. Python includes several libraries for processing mail. The one you'll need to use depends on the kind of mail server you're using. Modules for interacting with POP3 servers (
poplib) and IMAP servers (imaplib) are included. If you need to talk to a Microsoft Exchange server, you'll need some of the tools in the win32 distribution (see Appendix B, Platform-Specific Topics, for pointers to the win32 extensions web page).Here's a simple test of the
poplibmodule, which is used to talk to a mail server running the POP protocol:>>> from poplib import * >>> server = POP3('mailserver.spam.org') >>> print server.getwelcome() +OK QUALCOMM Pop server derived from UCB (version 2.1.4-R3) at spam starting. >>> server.user('da') '+OK Password required for da.' >>> server.pass_('youllneverguess') '+OK da has 153 message(s) (458167 octets).' >>> header, msg, octets = server.retr(152) # let's get the latest msgs >>> import string >>> print string.join(msg[:3], '\n') # and look at the first three lines Return-Path: <jim@bigbad.com> Received: from gator.bigbad.com by mailserver.spam.org (4.1/SMI-4.1) id AA29605; Wed, 25 Nov 98 15:59:24 PSTIn a real application, you'd use a specialized module such as
rfc822to parse the header lines, and perhaps themimetoolsandmimifymodules to get the data out of the message body (e.g., to process attached files).Bigger Examples
Compounding Your Interest
Someday, most of us hope to put a little money away in a savings account (assuming those student loans ever go away). Banks hope you do too, so much so that they'll pay you for the privilege of holding onto your money. In a typical savings account, your bank pays you interest on your principal. Moreover, they keep adding the percentage they pay you back to your total, so that your balance grows a little bit each year. The upshot is that you need to project on a year-by-year basis if you want to track the growth in your savings. This program, interest.py, is an easy way to do it in Python:
trace = 1 # print each year? def calc(principal, interest, years): for y in range(years): principal = principal * (1.00 + (interest / 100.0)) if trace: print y+1, '=> %.2f' % principal return principalThis function just loops through the number of years you pass in, accumulating the principal (your initial deposit plus all the interest added so far) for each year. It assumes that you'll avoid the temptation to withdraw money. Now, suppose we have $65,000 to invest in a 5.5% interest yield account, and want to track how the principal will grow over 10 years. We import and call our compounding function passing in a starting principal, an interest rate, and the number of years we want to project:
% python >>> from interest import calc >>> calc(65000, 5.5, 10) 1 => 68575.00 2 => 72346.63 3 => 76325.69 4 => 80523.60 5 => 84952.40 6 => 89624.78 7 => 94554.15 8 => 99754.62 9 => 105241.13 10 => 111029.39 111029.389793and we wind up with $111,029. If we just want to see the final balance, we can set the
traceglobal (module-level) variable ininterestto 0 before we call thecalcfunction:>>> import interest >>> interest.trace = 0 >>> calc(65000, 5.5, 10) 111029.389793Naturally, there are many ways to calculate compound interest. For example, the variation of the interest calculator function below adds to the principal explicitly, and prints both the interest earned (
earnings) and current balance (principal) as it steps through the years:def calc(principal, interest, years): interest = interest / 100.0 for y in range(years): earnings = principal * interest principal = principal + earnings if trace: print y+1, '(+%d)' % earnings, '=> %.2f' % principal return principalWe get the same results with this version, but more information:
>>> interest.trace = 1 >>> calc(65000, 5.5, 10) 1 (+3575) => 68575.00 2 (+3771) => 72346.63 3 (+3979) => 76325.69 4 (+4197) => 80523.60 5 (+4428) => 84952.40 6 (+4672) => 89624.78 7 (+4929) => 94554.15 8 (+5200) => 99754.62 9 (+5486) => 105241.13 10 (+5788) => 111029.39 111029.389793The last comment on this script is that it may not give you exactly the same numbers as your bank. Bank programs tend to round everything off to the cent on a regular basis. Our program rounds off the numbers to the cent when printing the results (that's what the
%.2fdoes; see Chapter 2 for details), but keeps the full precision afforded by the computer in its intermediate computation (as shown in the last line).An Automated Dial-Out Script
One upon a time, a certain book's coauthor worked at a company without an Internet feed. The system support staff did, however, install a dial-out modem on site, so anyone with a personal Internet account and a little Unix savvy could connect to a shell account and do all their Internet business at work. Dialing out meant using the Kermit file transfer utility.
One drawback with the modem setup was that people wanting to dial out had to keep trying each of 10 possible modems until one was free (dial on one; if it's busy, try another, and so on). Since modems were addressable under Unix using the filename pattern /dev/modem*, and modem locks via /var/spool/locks/LCK*modem*, a simple Python script was enough to check for free modems automatically. The following program, dokermit, uses a list of integers to keep track of which modems are locked,
glob.globto do filename expansion, andos.systemto run a kermit command when a free modem has been found:#!/usr/bin/env python # find a free modem to dial out on import glob, os, string LOCKS = "/var/spool/locks/" locked = [0] * 10 for lockname in glob.glob(LOCKS + "LCK*modem*"): # find locked modems print "Found lock:", lockname locked[string.atoi(lockname[-1])] = 1 # 0..9 at end of name print 'free: ', for i in range(10): # report, dial-out if not locked[i]: print i, print for i in range(10): if not locked[i]: if raw_input("Try %d? " % i) == 'y': os.system("kermit -m hayes -l /dev/modem%d -b 19200 -S" % i) if raw_input("More? ") != 'y': breakBy convention, modem lock files have the modem number at the end of their names; we use this hook to build a modem device name in the Kermit command. Notice that this script keeps a list of 10 integer flags to mark which modems are free (1 means locked). The program above works only if there are 10 or fewer modems; if there are more, you'd need to use larger lists and loops, and parse the lock filename, not just look at its last character.
An Interactive Rolodex
While most of the preceding examples use lists as the primary data structures, dictionaries are in many ways more powerful and fun to use. Their presence as a built-in data type is part of what makes Python high level, which basically means "easy to use for complex tasks." Complementing this rich set of built-in data types is an extensive standard library. One powerful module in this library is the
cmdmodule that provides a classCmdyou can subclass to make simple command-line interpreter. The following example is fairly large, but it's really not that complicated, and illustrates well the power of dictionaries and of reuse of standard modules.The task at hand is to keep track of names and phone numbers and allow the user to manipulate this list using an interactive interface, with error checking and user-friendly features such as online help. The following example shows the kind of interaction our program allows:
% python rolo.py Monty's Friends: help Documented commands (type help <topic>): ======================================== EOF add find list load save Undocumented commands: ====================== helpWe can get help on specific commands:
Monty's Friends: help find # compare with the help_find() method Find an entry (specify a name)We can manipulate the entries of the Rolodex easily enough:
Monty's Friends: add larry # we can add entries Enter Phone Number for larry: 555-1216 Monty's Friends: add # if the name is not specified... Enter Name: tom # ...the program will ask for it Enter Phone Number for tom: 555-1000 Monty's Friends: list ========================================= larry : 555-1216 tom : 555-1000 ========================================= Monty's Friends: find larry The number for larry is 555-1216. Monty's Friends: save myNames # save our work Monty's Friends: ^D # quit the program (^Z on Windows)And the nice thing is, when we restart this program, we can recover the saved data:
% python rolo.py # restart Monty's Friends: list # by default, there is no one listed Monty's Friends: load myNames # it only takes this to reload the dir Monty's Friends: list ========================================= larry : 555-1216 tom : 555-1000 =========================================Most of the interactive interpreter functionality is provided by the
Cmdclass in thecmdmodule, which just needs customization to work. Specifically, you need to set thepromptattribute and add some methods that start withdo_andhelp_. Thedo_methods must take a single argument, and the part after thedo_is the name of the command. Once you call thecmdloop()method, theCmdclass does the rest. Read the following code, rolo.py, one method at a time and compare the methods with the previous output:#!/usr/bin/env python # An interactive rolodex import string, sys, pickle, cmd class Rolodex(cmd.Cmd): def _ _init_ _(self): cmd.Cmd._ _init_ _(self) # initialize the base class self.prompt = "Monty's Friends: " # customize the prompt self.people = {} # at first, we know nobody def help_add(self): print "Adds an entry (specify a name)" def do_add(self, name): if name == "": name = raw_input("Enter Name: ") phone = raw_input("Enter Phone Number for "+ name+": ") self.people[name] = phone # add phone number for name def help_find(self): print "Find an entry (specify a name)" def do_find(self, name): if name == "": name = raw_input("Enter Name: ") if self.people.has_key(name): print "The number for %s is %s." % (name, self.people[name]) else: print "We have no record for %s." % (name,) def help_list(self): print "Prints the contents of the directory" def do_list(self, line): names = self.people.keys() # the keys are the names if names == []: return # if there are no names, exit names.sort() # we want them in alphabetic order print '='*41 for name in names: print string.rjust(name, 20), ":", string.ljust(self.people[name], 20) print '='*41 def help_EOF(self): print "Quits the program" def do_EOF(self, line): sys.exit() def help_save(self): print "save the current state of affairs" def do_save(self, filename): if filename == "": filename = raw_input("Enter filename: ") saveFile = open(filename, 'w') pickle.dump(self.people, saveFile) def help_load(self): print "load a directory" def do_load(self, filename): if filename == "": filename = raw_input("Enter filename: ") saveFile = open(filename, 'r') self.people = pickle.load(saveFile) # note that this will override # any existing people directory if _ _name_ _ == '_ _main_ _': # this way the module can be rolo = Rolodex() # imported by other programs as well rolo.cmdloop()So, the
peopleinstance variable is a simple mapping between names and phone numbers that theaddandfindcommands use. Commands are the methods which start withdo_, and their help is given by the correspondinghelp_methods. Finally, theloadandsavecommands use thepicklemodule, which is explained in more detail in Chapter 10, Frameworks and Applications.
# onecmd method of Cmd class, see Lib/cmd.py def onecmd(self, line): # line is something like "find tom" line = string.strip(line) # get rid of extra whitespace if not line: # if there is nothing left, line = self.lastcmd # redo the last command else: self.lastcmd = line # save for next time i, n = 0, len(line) # next line finds end of first word while i < n and line[i] in self.identchars: i = i+1 # split line into command + arguments cmd, arg = line[:i], string.strip(line[i:]) if cmd == '': # happens if line doesn't start with A-z return self.default(line) else: # cmd is 'find', line is 'tom' try: func = getattr(self, 'do_' + cmd) # look for method except AttributeError: return self.default(line) return func(arg) # call method with the rest of the lineThis example demonstrates the power of Python that comes from extending existing modules. The
cmdmodule takes care of the prompt, help facility, and parsing of the input. Thepicklemodule does all the loading and saving that can be so difficult in lesser languages. All we had to write were the parts specific to the task at hand. The generic aspect, namely an interactive interpreter, came free.Exercises
This chapter is full of programs we encourage you to type in and play with. However, if you really want exercises, here are a few more challenging ones:
- Redirecting stdout. Modify the mygrep.py script to output to the last file specified on the command line instead of to the console.
- Writing a shell. Using the
Cmdclass in thecmdmodule and the functions listed in Chapter 8 for manipulating files and directories, write a little shell that accepts the standard Unix commands (or DOS commands if you'd rather):ls(dir) for listing the current directory,cdfor changing directory,mv(orren) for moving/renaming a file, andcp(copy) for copying a file.- Understanding map, reduce, and filter. The
map,reduce, andfilterfunctions are somewhat difficult to understand if it's the first time you've encountered this type of function, partly because they involve passing functions as arguments, and partly because they do a lot even with such small names. One good way to ensure you know how they work is to rewrite them; in this exercise, write three functions (map2,reduce2,filter2), that do the same thing asmap,filter, andreduce, respectively, at least as far as we've described how they work:map2takes two arguments. The first should be a function accepting two arguments, orNone. The second should be a sequence. If the first argument is a function, that function is called with each element of the sequence, and the resulting values are returned in a list. If the first argument isNone, the sequence is converted to a list, and that list is returned.reduce2takes two arguments. The first must be a function accepting two arguments, and the second must be a sequence. The first two arguments of the sequence are used as arguments to the function, and the result of that call is sent as the first argument to the function again, with the third element to the sequence as the second argument, and so on, until all elements of the sequence have been used as arguments to the function. The last returned value from the function is then the return value for thereduce2call.filter2takes two arguments. The first can beNoneor a function accepting two arguments. The second must be a sequence. If the first argument isNone,filter2returns the subset of the elements in the sequence that tests true. If the first argument is a function,filter2is called with every element in the sequence in turn, and only those elements for which the return value of the function applied to them is true are returned byfilter2.
1. Some objects don't qualify as "reasonably copyable," such as modules, file objects, and sockets. Remember that file objects are different from files on disk.
2. The
randommodule provides many other useful functions, such as therandomfunction, which returns a random floating-point number between 0 and 1. Check a reference source for details.3. It turns out that
mapcan do more; for example, ifNoneis the first argument,mapconverts the sequence that is its second argument to a list. It can also operate on more than one sequence at a time. Check a reference source for details.4. Runtime binding means that Python doesn't know which sort of object implements an interface until the program is running. This behavior stems from the lack of type declarations in Python and leads to the notion of polymorphism; in Python, the meaning of a object operation (such as indexing, slicing, etc.) depends on the object being operated on.
5. Two important compatibility comments: the
win32pipemodule also has apopen2call, which is like thepopen2call on Unix, except that it returns the read and write pipes in swapped order (see the documentation forpopen2in theposixmodule for details on its interface). There is no equivalent ofpopenon Macs, since pipes don't exist on that operating system.6. XML (eXtensible Markup Language) is a language for marking up structured text files that emphasizes the structure of the document, not its graphical nature. XML processing is an entirely different area of Python text processing, with much ongoing work. See Appendix A, Python Resources, for some pointers to discussion groups and software.
© 2001, O'Reilly & Associates, Inc.