Credit: Mark Lutz, author of Programming Python, co-author of Learning Python
Behold the file—one of the first things that any reasonably pragmatic programmer reaches for in a programming language’s toolbox. Because processing external files is a very real, tangible task, the quality of file-processing interfaces is a good way to assess the practicality of a programming tool.
As the examples in this chapter attest, Python shines here too. In
fact, files in Python are supported in a variety of layers: from the
built-in open
function’s standard
file object, to specialized tools in standard library modules such as
os
, to third-party utilities available on the Web.
All told, Python’s arsenal of file tools provides
several powerful ways to access files in your scripts.
In
Python, a file object is an instance of a built-in
type. The built-in function
open
creates and returns a file object. The first argument, a string,
specifies the file’s path (i.e., the filename
preceded by an optional directory path). The second argument to
open
, also a string, specifies the mode in which
to open the file. For example:
input = open('data', 'r') output = open('/tmp/spam', 'w')
open
accepts a file path in which directories and
files are separated by slash characters
(/
), regardless of the
proclivities of the underlying operating system. On systems that
don’t use slashes, you can use a backslash character
(\
) instead, but
there’s no real reason to do so. Backslashes are
harder to fit nicely in string literals, since you have to double
them up or use “raw” strings. If
the file path argument does not include the file’s
directory name, the file is assumed to reside in the current working
directory (which is a disjoint concept from the Python module search
path).
For the mode argument, use
'r'
to read the file in text mode; this is the
default value and is commonly omitted, so that
open
is called with just one argument. Other
common modes are 'rb'
to read the file in binary
mode, 'w'
to create and write to the file in text
mode, and 'wb'
to create and write to the file in
binary mode.
The distinction between text mode and binary mode is important on
non-Unix-like platforms, because of the line-termination characters
used on these systems. When you open a file in binary mode, Python
knows that it doesn’t need to worry about
line-termination characters; it just
moves bytes between the file and in-memory strings without any kind
of translation. When you open a file in text mode on a non-Unix-like
system, however, Python knows it must translate between the
'\n'
line-termination characters used in strings
and whatever the current platform uses in the file itself. All of
your Python code can always rely on '\n'
as the
line-termination character, as long as you properly indicate text or
binary mode when you open the file.
Once you have a file
object, you perform all file I/O by calling methods of this object,
as we’ll discuss in a moment. When
you’re done with the file, you should finish by
calling the
close
method on the object, to close the connection to the file:
input.close( )
In short scripts, people often omit this step, as Python
automatically closes the file when a file object is reclaimed during
garbage collection. However, it is good programming practice to close
your files, and it is especially a good idea in larger programs. Note
that
try
/finally
is particularly well suited to ensuring that a file gets closed, even
when the program terminates by raising an uncaught exception.
To
write to a file, use the write
method:
output.write(s)
where s
is a string. Think of s
as a string of characters if output
is open for
text-mode writing and as a string of bytes if
output
is open for binary-mode writing. There are
other writing-related methods, such as flush
,
which sends any data that is being buffered, and
writelines
, which writes a list of strings in a
single call. However, none of these other methods appear in the
recipes in this chapter, as write
is by far the
most commonly used method.
Reading
from a file is more common than writing to a file, and there are more
issues involved, so file objects have more reading methods than
writing ones. The readline
method reads and
returns the next line from a text file:
while 1: line = input.readline( ) if not line: break process(line)
This was idiomatic Python, but it is no longer the best way to read
lines from a file. Another alternative is to use the
readlines
method, which reads the whole file and returns a list of lines:
for line in input.readlines( ): process(line)
However, this is useful only for files that fit comfortably in
physical memory. If the file is truly huge,
readlines
can fail or at least slow things down
quite drastically as virtual memory fills up and the operating system
has to start copying parts of physical memory to disk. Python 2.1
introduced the
xreadlines
method, which works just like readlines
in a
for
loop but consumes a bounded amount of memory
regardless of the size of the file:
for line in input.xreadlines( ): process(line)
Python 2.2 introduced the ideal solution, whereby you can loop on the
file object itself, implicitly getting a line at a time with the same
memory and performance characteristics of
xreadlines
:
for line in input: process(line)
Of course, you don’t always want to read a file line
by line. You may instead want to read some or all of the bytes in the
file, particularly if you’ve opened the file for
binary-mode reading, where lines are unlikely to be an applicable
concept. In this case, you can use the read
method. When called without arguments, read
reads
and returns all the remaining bytes from the file. When
read
is called with an integer argument
N
, it reads and returns the next
N
bytes (or all the remaining bytes, if less than
N
bytes remain). Other methods worth mentioning
are seek
and tell
, which
support random access to files. These are normally used with binary
files made up of fixed-length records.
On the surface, Python’s file support is straightforward. However, there are two aspects of Python’s file support that I want to underscore up-front, before you peruse the code in this chapter: script portability and interface flexibility.
Keep in mind that most file interfaces in Python are fully portable across platform boundaries. It would be difficult to overstate the importance of this feature. A Python script that searches all files in a directory tree for a bit of text, for example, can be freely moved from platform to platform without source-code changes: just copy the script’s source file to the new target machine. I do it all the time—so much so that I can happily stay out of operating-system wars. With Python’s portability, the underlying platform is largely irrelevant.
Also, it has always struck me that Python’s file-processing interfaces are not restricted to real, physical files. In fact, most file tools work with any kind of object that exposes the same interface as a real file object. Thus, a file reader cares only about read methods, and a file writer cares only about write methods. As long as the target object implements the expected protocol, all goes well.
For example, suppose you have written a general file-processing function such as the following, intending to apply a passed-in function to each line in an input file:
def scanner(fileobject, linehandler): for line in fileobject.readlines( ): linehandler(line)
If you code this function in a module file and drop that file in a directory listed on your Python search path, you can use it anytime you need to scan a text file, now or in the future. To illustrate, here is a client script that simply prints the first word of each line:
from myutils import scanner def firstword(line): print line.split( )[0] file = open('data') scanner(file, firstword)
So far, so good; we’ve just coded a reusable
software component. But notice that there are no type declarations in
the scanner
function, only an interface
constraint—any object with a readlines
method will do. For instance, suppose you later want to provide
canned test input from a string object, instead of from a real,
physical file. The standard
StringIO
module, and the equivalent but faster
cStringIO
, provide the appropriate wrapping and
interface forgery:
from cStringIO import StringIO from myutils import scanner def firstword(line): print line.split( )[0] string = StringIO('one\ntwo xxx\nthree\n') scanner(string, firstword)
Here, StringIO
objects are plug-and-play
compatible with file objects, so scanner
takes its
three lines of text from an in-memory string object, rather than a
true external file. You don’t need to change the
scanner to make this work—just send it the right kind of
object. For more generality, use a class to implement the expected
interface instead:
class MyStream: def readlines(self): # Grab and return text from a source here return ['a\n', 'b c d\n'] from myutils import scanner def firstword(line): print line.split( )[0] object = MyStream( ) scanner(object, firstword)
This time, as scanner
attempts to read the file,
it really calls out to the readlines
method
you’ve coded in your class. In practice, such a
method might use other Python standard tools to grab text from a
variety of sources: an interactive user, a pop-up GUI input box, a
shelve
object, an SQL database, an XML or HTML
page, a network socket, and so on. The point is that
scanner
doesn’t know or care what
kind of object is implementing the interface it expects, or what that
interface actually does.
Object-oriented programmers know this deliberate naiveté
as
polymorphism.
The type of the object being processed determines what an operation,
such as the readlines
method call in
scanner
, actually does. Everywhere in Python,
object interfaces, rather than specific data types, are the unit of
coupling. The practical effect is that functions are often applicable
to a much broader range of problems than you might expect. This is
especially true if you have a background in strongly typed languages
such as C or C++. It is almost as if we get C++ templates for free in
Python. Code has an innate flexibility that is a byproduct of
Python’s dynamic typing.
Of course, code portability and flexibility run rampant in Python development and are not really confined to file interfaces. Both are features of the language that are simply inherited by file-processing scripts. Other Python benefits, such as its easy scriptability and code readability, are also key assets when it comes time to change file-processing programs. But, rather than extolling all of Python’s virtues here, I’ll simply defer to the wonderful example programs in this chapter and this text at large for more details. Enjoy!
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.