Chapter 4. File and Directory Tools
“Erase Your Hard Drive in Five Easy Steps!”
This chapter continues our look at system interfaces in Python by focusing on file and directory-related tools. As you’ll see, it’s easy to process files and directory trees with Python’s built-in and standard library support. Because files are part of the core Python language, some of this chapter’s material is a review of file basics covered in books like Learning Python, Fourth Edition, and we’ll defer to such resources for more background details on some file-related concepts. For example, iteration, context managers, and the file object’s support for Unicode encodings are demonstrated along the way, but these topics are not repeated in full here. This chapter’s goal is to tell enough of the file story to get you started writing useful scripts.
File Tools
External files are at the heart of much of what we do with system utilities. For instance, a testing system may read its inputs from one file, store program results in another file, and check expected results by loading yet another file. Even user interface and Internet-oriented programs may load binary images and audio clips from files on the underlying computer. It’s a core programming concept.
In Python, the built-in open
function is the primary tool scripts use to access the
files on the underlying computer system. Since this function is an
inherent part of the Python language, you may already be familiar with
its basic workings. When called, the open
function returns a new file object that is connected to
the external file; the file object has methods that transfer data to
and from the file and perform a variety of file-related operations.
The open
function also provides a
portable interface to the underlying
filesystem—it works the same way on every platform on which Python
runs.
Other file-related modules built into Python allow us to do
things such as manipulate lower-level descriptor-based files (os
); copy, remove, and move files and
collections of files (os
and
shutil
); store data and objects in
files by key (dbm
and shelve
); and access SQL databases (sqlite3
and third-party add-ons). The last
two of these categories are related to database topics, addressed in
Chapter 17.
In this section, we’ll take a brief tutorial look at the
built-in file object and explore a handful of more advanced
file-related topics. As usual, you should consult either Python’s
library manual or reference books such as Python Pocket
Reference for further details and methods we don’t
have space to cover here. Remember, for quick interactive help, you
can also run dir(
file
)
on an open file object to see an
attributes list that includes methods; help(
file
)
for general help; and help(
file
.read)
for help on a specific method such as
read
, though the file object
implementation in 3.1 provides less information for help
than the library manual and other
resources.
The File Object Model in Python 3.X
Just like the string types we noted in Chapter 2,
file support in Python 3.X is a bit richer than it was in the past.
As we noted earlier, in Python 3.X str
strings always represent Unicode text
(ASCII or wider), and bytes
and
bytearray
strings represent raw
binary data. Python 3.X draws a similar and related distinction
between files containing text and binary data:
Text files contain Unicode text. In your script, text file content is always a
str
string—a sequence of characters (technically, Unicode “code points”). Text files perform the automatic line-end translations described in this chapter by default and automatically apply Unicode encodings to file content: they encode to and decode from raw binary bytes on transfers to and from the file, according to a provided or default encoding name. Encoding is trivial for ASCII text, but may be sophisticated in other cases.Binary files contain raw 8-bit bytes. In your script, binary file content is always a byte string, usually a
bytes
object—a sequence of small integers, which supports moststr
operations and displays as ASCII characters whenever possible. Binary files perform no translations of data when it is transferred to and from files: no line-end translations or Unicode encodings are performed.
In practice, text files are used for all truly text-related
data, and binary files store items like packed binary data, images,
audio files, executables, and so on. As a programmer you distinguish
between the two file types in the mode string argument you pass to
open
: adding a “b” (e.g.,
'rb'
, 'wb'
) means the file contains binary data.
For coding new file content, use normal strings for text (e.g.,
'spam'
or bytes.decode()
) and byte strings for
binary (e.g., b'spam'
or str.encode()
).
Unless your file scope is limited to ASCII text, the 3.X
text/binary distinction can sometimes impact your code. Text files
create and require str
strings,
and binary files use byte strings; because you cannot freely mix the
two string types in expressions, you must choose file mode
carefully. Many built-in tools we’ll use in this book make the
choice for us; the struct
and
pickle
modules, for instance,
deal in byte strings in 3.X, and the xml
package in Unicode str
. You must even be aware of the 3.X
text/binary distinction when using system tools like pipe
descriptors and sockets, because they transfer
data as byte strings today (though their content can be decoded and
encoded as Unicode text if needed).
Moreover, because text-mode files require that content be
decodable per a Unicode encoding scheme, you must read undecodable
file content in binary mode, as byte strings (or catch Unicode
exceptions in try
statements and
skip the file altogether). This may include both truly binary files
as well as text files that use encodings that are nondefault and
unknown. As we’ll see later in this chapter, because str
strings are always Unicode in 3.X,
it’s sometimes also necessary to select byte string mode for the
names of files in directory tools such as os.listdir
, glob.glob
, and os.walk
if they cannot be decoded (passing
in byte strings essentially suppresses decoding).
In fact, we’ll see examples where the Python 3.X distinction
between str
text and bytes
binary pops up in tools beyond basic
files throughout this book—in Chapters 5
and 12 when we explore sockets; in
Chapters 6 and 11
when we’ll need to ignore Unicode errors in file and directory
searches; in Chapter 12, where we’ll see
how client-side Internet protocol modules such as FTP and email,
which run atop sockets, imply file modes and encoding requirements;
and more.
But just as for string types, although we will see some of these concepts in action in this chapter, we’re going to take much of this story as a given here. File and string objects are core language material and are prerequisite to this text. As mentioned earlier, because they are addressed by a 45-page chapter in the book Learning Python, Fourth Edition, I won’t repeat their coverage in full in this book. If you find yourself confused by the Unicode and binary file and string concepts in the following sections, I encourage you to refer to that text or other resources for more background information in this domain.
Using Built-in File Objects
Despite the text/binary dichotomy in Python 3.X, files are still very
straightforward to use. For most purposes, in fact, the open
built-in
function and its files objects are all you need to remember to
process files in your scripts. The file object returned by open
has methods for reading data
(read
, readline
, readlines
); writing data (write
, writelines
); freeing system resources
(close
); moving to arbitrary
positions in the file (seek
);
forcing data in output buffers to be transferred to disk (flush
); fetching the underlying file
handle (fileno
); and more. Since
the built-in file object is so easy to use, let’s jump right into a
few interactive examples.
Output files
To make a new file, call open
with two arguments: the external name of the
file to be created and a mode string w
(short for
write). To store data on the file, call the
file object’s write
method with a string containing
the data to store, and then call the close
method to
close the file. File write
calls return the number of characters or bytes written (which
we’ll sometimes omit in this book to save space), and as we’ll
see, close
calls are often
optional, unless you need to open and read the file again during
the same program or session:
C:\temp>python
>>>file = open('data.txt', 'w')
# open output file object: creates >>>file.write('Hello file world!\n')
# writes strings verbatim 18 >>>file.write('Bye file world.\n')
# returns number chars/bytes written 18 >>>file.close()
# closed on gc and exit too
And that’s it—you’ve just generated a brand-new text file on your computer, regardless of the computer on which you type this code:
C:\temp>dir data.txt /B
data.txt C:\temp>type data.txt
Hello file world! Bye file world.
There is nothing unusual about the new file; here, I use the
DOS dir
and type
commands to list and display the
new file, but it shows up in a file explorer GUI, too.
Opening
In the open
function
call shown in the preceding example, the first
argument can optionally specify a complete directory path as
part of the filename string. If we pass just a simple filename
without a path, the file will appear in Python’s current working
directory. That is, it shows up in the place where the code is
run. Here, the directory C:\temp on my
machine is implied by the bare filename
data.txt, so this actually creates a file
at C:\temp\data.txt. More accurately, the
filename is relative to the current working directory if it does
not include a complete absolute directory path. See Current Working Directory (Chapter 3), for a refresher on this
topic.
Also note that when opening in w
mode, Python either creates the
external file if it does not yet exist or erases the file’s
current contents if it is already present on your machine (so be
careful out there—you’ll delete whatever was in the file
before).
Writing
Notice that we added an explicit \n
end-of-line character to lines
written to the file; unlike the print
built-in function, file object
write
methods write exactly
what they are passed without adding any extra formatting. The
string passed to write
shows
up character for character on the external file. In text files,
data written may undergo line-end or Unicode translations which
we’ll describe ahead, but these are undone when the data is
later read back.
Output files also sport a writelines
method, which simply writes all of the strings in a list one at
a time without adding any extra formatting. For example, here is
a writelines
equivalent to
the two write
calls shown
earlier:
file.writelines(['Hello file world!\n', 'Bye file world.\n'])
This call isn’t as commonly used (and can be emulated with
a simple for
loop or other
iteration tool), but it is convenient in scripts that save
output in a list to be written later.
Closing
The file close
method
used earlier finalizes file contents and frees up
system resources. For instance, closing forces buffered output
data to be flushed out to disk. Normally, files are
automatically closed when the file object is garbage collected
by the interpreter (that is, when it is no longer referenced).
This includes all remaining open files when the Python session
or program exits. Because of that, close
calls are often optional. In
fact, it’s common to see file-processing code in Python in this
idiom:
open('somefile.txt', 'w').write("G'day Bruce\n") # write to temporary object open('somefile.txt', 'r').read() # read from temporary object
Since both these expressions make a temporary file object,
use it immediately, and do not save a reference to it, the file
object is reclaimed right after data is transferred, and is
automatically closed in the process. There is usually no need
for such code to call the close
method explicitly.
In some contexts, though, you may wish to explicitly close anyhow:
For one, because the Jython implementation relies on Java’s garbage collector, you can’t always be as sure about when files will be reclaimed as you can in standard Python. If you run your Python code with Jython, you may need to close manually if many files are created in a short amount of time (e.g. in a loop), in order to avoid running out of file resources on operating systems where this matters.
For another, some IDEs, such as Python’s standard IDLE GUI, may hold on to your file objects longer than you expect (in stack tracebacks of prior errors, for instance), and thus prevent them from being garbage collected as soon as you might expect. If you write to an output file in IDLE, be sure to explicitly close (or flush) your file if you need to reliably read it back during the same IDLE session. Otherwise, output buffers might not be flushed to disk and your file may be incomplete when read.
And while it seems very unlikely today, it’s not impossible that this auto-close on reclaim file feature could change in future. This is technically a feature of the file object’s implementation, which may or may not be considered part of the language definition over time.
For these reasons, manual close calls are not a bad idea in nontrivial programs, even if they are technically not required. Closing is a generally harmless but robust habit to form.
Ensuring file closure: Exception handlers and context managers
Manual file close method calls are easy in straight-line code, but how do you ensure file closure when exceptions might kick your program beyond the point where the close call is coded? First of all, make sure you must—files close themselves when they are collected, and this will happen eventually, even when exceptions occur.
If closure is required, though, there are two basic
alternatives: the try
statement’s
finally
clause is the most
general, since it allows you to provide general exit actions for
any type of exceptions:
myfile = open(filename, 'w') try: ...process myfile... finally: myfile.close()
In recent Python releases, though, the with
statement
provides a more concise alternative for some specific objects and
exit actions, including closing files:
with open(filename, 'w') as myfile: ...process myfile, auto-closed on statement exit...
This statement relies on the file object’s context manager: code automatically run both on statement entry and on statement exit regardless of exception behavior. Because the file object’s exit code closes the file automatically, this guarantees file closure whether an exception occurs during the statement or not.
The with
statement is
notably shorter (3 lines) than the try
/finally
alternative, but it’s also less
general—with
applies only to
objects that support the context manager protocol, whereas
try
/finally
allows arbitrary exit actions
for arbitrary exception contexts. While some other object types
have context managers, too (e.g., thread locks), with
is limited in scope. In fact, if
you want to remember just one exit actions option, try
/finally
is the most inclusive. Still,
with
yields less code for files
that must be closed and can serve well in such specific roles. It
can even save a line of code when no exceptions are expected (albeit at
the expense of further nesting and indenting file processing logic):
myfile = open(filename, 'w') # traditional form ...process myfile... myfile.close() with open(filename) as myfile: # context manager form ...process myfile...
In Python 3.1 and later, this statement can also specify
multiple (a.k.a. nested) context managers—any number of context
manager items may be separated by commas, and multiple items work
the same as nested with
statements. In general terms, the 3.1 and later code:
with A() as a, B() as b: ...statements...
Runs the same as the following, which works in 3.1, 3.0, and 2.6:
with A() as a: with B() as b: ...statements...
For example, when the with
statement block exits in the
following, both files’ exit actions are automatically run to close
the files, regardless of exception outcomes:
with open('data') as fin, open('results', 'w') as fout: for line in fin: fout.write(transform(line))
Context manager–dependent code like this seems to have
become more common in recent years, but this is likely at least in
part because newcomers are accustomed to languages that require
manual close calls in all cases. In most contexts there is no need
to wrap all your Python file-processing code in with
statements—the files object’s
auto-close-on-collection behavior often suffices, and manual close
calls are enough for many other scripts. You should use the
with
or try
options outlined here only if you
must close, and only in the presence of potential exceptions.
Since standard C Python automatically closes files on collection,
though, neither option is required in many (and perhaps most) scripts.
Input files
Reading data from external files is just as easy as writing, but
there are more methods that let us load data in a variety of
modes. Input text files are opened with either a mode flag of
r
(for “read”) or no mode flag
at all—it defaults to r
if
omitted, and it commonly is. Once opened, we can read the lines of
a text file with the readlines
method:
C:\temp>python
>>>file = open('data.txt')
# open input file object: 'r' default >>>lines = file.readlines()
# read into line string list >>>for line in lines:
# BUT use file line iterator! (ahead) ...print(line, end='')
# lines have a '\n' at end ... Hello file world! Bye file world.
The readlines
method
loads the entire contents of the file into memory and gives it to
our scripts as a list of line strings that we can step through in
a loop. In fact, there are many ways to read an input file:
Let’s run these method calls to read files, lines, and
characters from a text file—the seek(0)
call is used here before each
test to rewind the file to its beginning (more on this call in a
moment):
>>>file.seek(0)
# go back to the front of file >>>file.read()
# read entire file into string 'Hello file world!\nBye file world.\n' >>>file.seek(0)
# read entire file into lines list >>>file.readlines()
['Hello file world!\n', 'Bye file world.\n'] >>>file.seek(0)
>>>file.readline()
# read one line at a time 'Hello file world!\n' >>>file.readline()
'Bye file world.\n' >>>file.readline()
# empty string at end-of-file '' >>>file.seek(0)
# read N (or remaining) chars/bytes >>>file.read(1), file.read(8)
# empty string at end-of-file ('H', 'ello fil')
All of these input methods let us be specific about how much to fetch. Here are a few rules of thumb about which to choose:
read()
andreadlines()
load the entire file into memory all at once. That makes them handy for grabbing a file’s contents with as little code as possible. It also makes them generally fast, but costly in terms of memory for huge files—loading a multigigabyte file into memory is not generally a good thing to do (and might not be possible at all on a given computer).On the other hand, because the
readline()
andread(N)
calls fetch just part of the file (the next line or N-character-or-byte block), they are safer for potentially big files but a bit less convenient and sometimes slower. Both return an empty string when they reach end-of-file. If speed matters and your files aren’t huge,read
orreadlines
may be a generally better choice.See also the discussion of the newer file iterators in the next section. As we’ll see, iterators combine the convenience of
readlines()
with the space efficiency ofreadline()
and are the preferred way to read text files by lines today.
The seek(0)
call used
repeatedly here means “go back to the start of the file.” In our
example, it is an alternative to reopening the file each time. In
files, all read and write operations take place at the current
position; files normally start at offset 0 when opened and advance
as data is transferred. The seek
call simply lets us move to a new
position for the next transfer operation. More on this method
later when we explore random access files.
Reading lines with file iterators
In older versions of Python, the traditional way to read a file line
by line in a for
loop was to
read the file into a list that could be stepped through as
usual:
>>>file = open('data.txt')
>>>for line in file.readlines():
# DON'T DO THIS ANYMORE! ...print(line, end='')
If you’ve already studied the core language using a first
book like Learning
Python, you may already know that this coding
pattern is actually more work than is needed today—both for you and your
computer’s memory. In recent Pythons, the file object includes an
iterator which is smart enough to grab just
one line per request in all iteration contexts, including for
loops and list comprehensions. The
practical benefit of this extension is that you no longer need to
call readlines
in a
for
loop to scan line by
line—the iterator reads lines on request automatically:
>>>file = open('data.txt')
>>>for line in file:
# no need to call readlines ...print(line, end='')
# iterator reads next line each time ... Hello file world! Bye file world.
Better still, you can open the file in the loop statement itself, as a temporary which will be automatically closed on garbage collection when the loop ends (that’s normally the file’s sole reference):
>>>for line in open('data.txt'):
# even shorter: temporary file object ...print(line, end='')
# auto-closed when garbage collected ... Hello file world! Bye file world.
Moreover, this file line-iterator form does not load the
entire file into a line’s list all at once, so it will be more
space efficient for large text files. Because of that, this is the
prescribed way to read line by line today. If you want to see what
really happens inside the for
loop, you can use the iterator manually; it’s just a __next__
method (run by the next
built-in function), which is
similar to calling the readline
method
each time through, except that read methods return an empty string
at end-of-file (EOF
) and the
iterator raises an exception to end the iteration:
>>>file = open('data.txt')
# read methods: empty at EOF >>>file.readline()
'Hello file world!\n' >>>file.readline()
'Bye file world.\n' >>>file.readline()
'' >>>file = open('data.txt')
# iterators: exception at EOF >>>file.__next__()
# no need to call iter(file) first, 'Hello file world!\n' # since files are their own iterator >>>file.__next__()
'Bye file world.\n' >>>file.__next__()
Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
Interestingly, iterators are automatically used in all
iteration contexts, including the list
constructor call, list
comprehension expressions, map
calls, and in
membership
checks:
>>>open('data.txt').readlines()
# always read lines ['Hello file world!\n', 'Bye file world.\n'] >>>list(open('data.txt'))
# force line iteration ['Hello file world!\n', 'Bye file world.\n'] >>>lines = [line.rstrip() for line in open('data.txt')]
# comprehension >>>lines
['Hello file world!', 'Bye file world.'] >>>lines = [line.upper() for line in open('data.txt')]
# arbitrary actions >>>lines
['HELLO FILE WORLD!\n', 'BYE FILE WORLD.\n'] >>>list(map(str.split, open('data.txt')))
# apply a function [['Hello', 'file', 'world!'], ['Bye', 'file', 'world.']] >>>line = 'Hello file world!\n'
>>>line in open('data.txt')
# line membership True
Iterators may seem somewhat implicit at first glance, but they’re representative of the many ways that Python makes developers’ lives easier over time.
Other open options
Besides the w
and (default) r
file open modes, most platforms support an a
mode string, meaning “append.” In this
output mode, write
methods add
data to the end of the file, and the open
call will not erase the current
contents of the file:
>>>file = open('data.txt', 'a')
# open in append mode: doesn't erase >>>file.write('The Life of Brian')
# added at end of existing data >>>file.close()
>>> >>>open('data.txt').read()
# open and read entire file 'Hello file world!\nBye file world.\nThe Life of Brian'
In fact, although most files are opened using the sorts of
calls we just ran, open
actually supports additional arguments for more specific
processing needs, the first three of which are the most commonly
used—the filename, the open mode, and a buffering specification.
All but the first of these are optional: if omitted, the open mode
argument defaults to r
(input),
and the buffering policy is to enable full buffering. For special
needs, here are a few things you should know about these three
open
arguments:
- Filename
As mentioned earlier, filenames can include an explicit directory path to refer to files in arbitrary places on your computer; if they do not, they are taken to be names relative to the current working directory (described in the prior chapter). In general, most filename forms you can type in your system shell will work in an
open
call. For instance, a relative filename argumentr'..\temp\spam.txt'
on Windows means spam.txt in the temp subdirectory of the current working directory’s parent—up one, and down to directory temp.- Open mode
The
open
function accepts other modes, too, some of which we’ll see at work later in this chapter:r+
,w+
, anda+
to open for reads and writes, and any mode string with ab
to designate binary mode. For instance, moder+
means both reads and writes are allowed on an existing file;w+
allows reads and writes but creates the file anew, erasing any prior content;rb
andwb
read and write data in binary mode without any translations; andwb+
andr+b
both combine binary mode and input plus output. In general, the mode string defaults tor
for read but can bew
for write anda
for append, and you may add a+
for update, as well as ab
ort
for binary or text mode; order is largely irrelevant.As we’ll see later in this chapter, the
+
modes are often used in conjunction with the file object’sseek
method to achieve random read/write access. Regardless of mode, file contents are always strings in Python programs—read methods return a string, and we pass a string to write methods. As also described later, though, the mode string implies which type of string is used:str
for text mode orbytes
and other byte string types for binary mode.- Buffering policy
The
open
call also takes an optional third buffering policy argument which lets you control buffering for the file—the way that data is queued up before being transferred, to boost performance. If passed, 0 means file operations are unbuffered (data is transferred immediately, but allowed in binary modes only), 1 means they are line buffered, and any other positive value means to use a full buffering (which is the default, if no buffering argument is passed).
As usual, Python’s library manual and reference texts have
the full story on additional open
arguments beyond these three. For
instance, the open
call
supports additional arguments related to the
end-of-line mapping behavior and the
automatic Unicode encoding of content
performed for text-mode files. Since we’ll discuss both of these
concepts in the next section, let’s move ahead.
Binary and Text Files
All of the preceding examples process simple text files, but
Python scripts can also open and process files containing binary data—JPEG images, audio
clips, packed binary data produced by FORTRAN and C programs,
encoded text, and anything else that can be stored in files as
bytes. The primary difference in terms of your code is the
mode argument passed to the built-in open
function:
>>>file = open('data.txt', 'wb')
# open binary output file >>>file = open('data.txt', 'rb')
# open binary input file
Once you’ve opened binary files in this way, you may read and
write their contents using the same methods just illustrated:
read
, write
, and so on. The readline
and readlines
methods as well as the file’s
line iterator still work here for text files opened in binary mode,
but they don’t make sense for truly binary data that isn’t line
oriented (end-of-line bytes are meaningless, if they appear at
all).
In all cases, data transferred between files and your programs is represented as Python strings within scripts, even if it is binary data. For binary mode files, though, file content is represented as byte strings. Continuing with our text file from preceding examples:
>>>open('data.txt').read()
# text mode: str 'Hello file world!\nBye file world.\nThe Life of Brian' >>>open('data.txt', 'rb').read()
# binary mode: bytes b'Hello file world!\r\nBye file world.\r\nThe Life of Brian' >>>file = open('data.txt', 'rb')
>>>for line in file: print(line)
... b'Hello file world!\r\n' b'Bye file world.\r\n' b'The Life of Brian'
This occurs because Python 3.X treats text-mode files as Unicode, and automatically decodes
content on input and encodes it on output. Binary mode files instead
give us access to file content as raw byte strings, with no
translation of content—they reflect exactly what is stored on the
file. Because str
strings are
always Unicode text in 3.X, the special bytes
string is required to represent
binary data as a sequence of byte-size integers which may contain
any 8-bit value. Because normal and byte strings have almost
identical operation sets, many programs can largely take this on
faith; but keep in mind that you really must
open truly binary data in binary mode for input, because it will not
generally be decodable as Unicode text.
Similarly, you must also supply byte strings for binary mode output—normal strings are not raw binary data, but are decoded Unicode characters (a.k.a. code points) which are encoded to binary on text-mode output:
>>>open('data.bin', 'wb').write(b'Spam\n')
5 >>>open('data.bin', 'rb').read()
b'Spam\n' >>>open('data.bin', 'wb').write('spam\n')
TypeError: must be bytes or buffer, not str
But notice that this file’s line ends with just \n
, instead of the Windows \r\n
that showed up in the preceding
example for the text file in binary mode. Strictly speaking, binary
mode disables Unicode encoding translation, but it also prevents the
automatic end-of-line character translation performed by text-mode
files by default. Before we can understand this fully, though, we
need to study the two main ways in which text files differ from
binary.
Unicode encodings for text files
As mentioned earlier, text-mode file objects always translate data according to a default or provided Unicode encoding type, when the data is transferred to and from external file. Their content is encoded on files, but decoded in memory. Binary mode files don’t perform any such translation, which is what we want for truly binary data. For instance, consider the following string, which embeds a Unicode character whose binary value is outside the normal 7-bit range of the ASCII encoding standard:
>>>data = 'sp\xe4m'
>>>data
'späm' >>>0xe4, bin(0xe4), chr(0xe4)
(228, '0b11100100', 'ä')
It’s possible to manually encode this string according to a variety of Unicode encoding types—its raw binary byte string form is different under some encodings:
>>>data.encode('latin1')
# 8-bit characters: ascii + extras b'sp\xe4m' >>>data.encode('utf8')
# 2 bytes for special characters only b'sp\xc3\xa4m' >>>data.encode('ascii')
# does not encode per ascii UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 2: ordinal not in range(128)
Python displays printable characters in these strings
normally, but nonprintable bytes show as \xNN
hexadecimal escapes which become
more prevalent under more sophisticated encoding schemes (cp500
in the following is an EBCDIC encoding):
>>>data.encode('utf16')
# 2 bytes per character plus preamble b'\xff\xfes\x00p\x00\xe4\x00m\x00' >>>data.encode('cp500')
# an ebcdic encoding: very different b'\xa2\x97C\x94'
The encoded results here reflect the string’s raw binary
form when stored in files. Manual encoding is usually unnecessary,
though, because text files handle encodings automatically on data
transfers—reads decode and writes encode, according to the encoding name passed in (or a
default for the underlying platform: see sys.
get
default
encoding
). Continuing our
interactive session:
>>>open('data.txt', 'w', encoding='latin1').write(data)
4 >>>open('data.txt', 'r', encoding='latin1').read()
'späm' >>>open('data.txt', 'rb').read()
b'sp\xe4m'
If we open in binary mode, though, no encoding translation occurs—the last command in the preceding example shows us what’s actually stored on the file. To see how file content differs for other encodings, let’s save the same string again:
>>>open('data.txt', 'w', encoding='utf8').write(data)
# encode data per utf8 4 >>>open('data.txt', 'r', encoding='utf8').read()
# decode: undo encoding 'späm' >>>open('data.txt', 'rb').read()
# no data translations b'sp\xc3\xa4m'
This time, raw file content is different, but text mode’s auto-decoding makes the string the same by the time it’s read back by our script. Really, encodings pertain only to strings while they are in files; once they are loaded into memory, strings are simply sequences of Unicode characters (“code points”). This translation step is what we want for text files, but not for binary. Because binary modes skip the translation, you’ll want to use them for truly binary data. If fact, you usually must—trying to write unencodable data and attempting to read undecodable data is an error:
>>>open('data.txt', 'w', encoding='ascii').write(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 2: ordinal not in range(128) >>>open(r'C:\Python31\python.exe', 'r').read()
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2: character maps to <undefined>
Binary mode is also a last resort for reading text files, if they cannot be decoded per the underlying platform’s default, and the encoding type is unknown—the following recreates the original strings if encoding type is known, but fails if it is not known unless binary mode is used (such failure may occur either on inputting the data or printing it, but it fails nevertheless):
>>>open('data.txt', 'w', encoding='cp500').writelines(['spam\n', 'ham\n'])
>>>open('data.txt', 'r', encoding='cp500').readlines()
['spam\n', 'ham\n'] >>>open('data.txt', 'r').readlines()
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined> >>>open('data.txt', 'rb').readlines()
[b'\xa2\x97\x81\x94\r%\x88\x81\x94\r%'] >>>open('data.txt', 'rb').read()
b'\xa2\x97\x81\x94\r%\x88\x81\x94\r%'
If all your text is ASCII you generally can ignore encoding altogether; data in files maps directly to characters in strings, because ASCII is a subset of most platforms’ default encodings. If you must process files created with other encodings, and possibly on different platforms (obtained from the Web, for instance), binary mode may be required if encoding type is unknown. Keep in mind, however, that text in still-encoded binary form might not work as you expect: because it is encoded per a given encoding scheme, it might not accurately compare or combine with text encoded in other schemes.
Again, see other resources for more on the Unicode story.
We’ll revisit the Unicode story at various points in this book,
especially in Chapter 9, to
see how it relates to the tkinter Text
widget, and in Part IV, covering Internet programming,
to learn what it means for data shipped over networks by protocols
such as FTP, email, and the Web at large. Text files have another
feature, though, which is similarly a nonfeature for binary data:
line-end translations, the topic of the next section.
End-of-line translations for text files
For historical reasons, the end of a line of text in a file is represented
by different characters on different platforms. It’s a single
\n
character on Unix-like
platforms, but the two-character sequence \r\n
on Windows. That’s why files moved
between Linux and Windows may look odd in your text editor after
transfer—they may still be stored using the original platform’s
end-of-line convention.
For example, most Windows editors handle text in Unix
format, but Notepad has been a notable exception—text files copied
from Unix or Linux may look like one long line when viewed in
Notepad, with strange characters inside (\n
). Similarly, transferring a file from
Windows to Unix in binary mode retains the \r
characters (which often appear as
^M
in text editors).
Python scripts that process text files don’t normally have
to care, because the files object automatically maps the DOS
\r\n
sequence to a single
\n
. It works like this by
default—when scripts are run
on Windows:
For files opened in text mode,
\r\n
is translated to\n
when input.For files opened in text mode,
\n
is translated to\r\n
when output.For files opened in binary mode, no translation occurs on input or output.
On Unix-like platforms, no translations occur, because
\n
is used in files. You should
keep in mind two important consequences of these rules. First, the
end-of-line character for text-mode files is almost always
represented as a single \n
within Python scripts, regardless of how it is stored in external
files on the underlying platform. By mapping to and from \n
on input and output, Python hides the
platform-specific difference.
The second consequence of the mapping is subtler: when
processing binary files, binary open modes (e.g, rb
, wb
) effectively turn off line-end
translations. If they did not, the translations listed previously
could very well corrupt data as it is input or output—a random
\r
in data might be dropped on
input, or added for a \n
in the
data on output. The net effect is that your binary data would be
trashed when read and written—probably not quite what you want for
your audio files and images!
This issue has become almost secondary in Python 3.X,
because we generally cannot use binary data with text-mode files
anyhow—because text-mode files automatically apply Unicode
encodings to content, transfers will generally fail when the data
cannot be decoded on input or encoded on output. Using binary mode
avoids Unicode errors, and automatically disables line-end
translations as well (Unicode error can be caught in try
statements as well). Still, the fact
that binary mode prevents end-of-line translations to protect file
content is best noted as a separate feature, especially if you
work in an ASCII-only world where Unicode encoding issues are
irrelevant.
Here’s the end-of-line translation at work in Python 3.1 on Windows—text mode translates to and from the platform-specific line-end sequence so our scripts are portable:
>>>open('temp.txt', 'w').write('shrubbery\n')
# text output mode: \n -> \r\n 10 >>>open('temp.txt', 'rb').read()
# binary input: actual file bytes b'shrubbery\r\n' >>>open('temp.txt', 'r').read()
# test input mode: \r\n -> \n 'shrubbery\n'
By contrast, writing data in binary mode prevents all translations as expected, even if the data happens to contain bytes that are part of line-ends in text mode (byte strings print their characters as ASCII if printable, else as hexadecimal escapes):
>>>data = b'a\0b\rc\r\nd'
# 4 escape code bytes, 4 normal >>>len(data)
8 >>>open('temp.bin', 'wb').write(data)
# write binary data to file as is 8 >>>open('temp.bin', 'rb').read()
# read as binary: no translation b'a\x00b\rc\r\nd'
But reading binary data in text mode, whether accidental or not, can corrupt the data when transferred because of line-end translations (assuming it passes as decodable at all; ASCII bytes like these do on this Windows platform):
>>> open('temp.bin', 'r').read()
# text mode read: botches \r !
'a\x00b\nc\nd'
Similarly, writing binary data in text mode can have as the same effect—line-end bytes may be changed or inserted (again, assuming the data is encodable per the platform’s default):
>>>open('temp.bin', 'w').write(data)
# must pass str for text mode TypeError: must be str, not bytes # use bytes.decode() for to-str >>>data.decode()
'a\x00b\rc\r\nd' >>>open('temp.bin', 'w').write(data.decode())
8 >>>open('temp.bin', 'rb').read()
# text mode write: added \r ! b'a\x00b\rc\r\r\nd' >>>open('temp.bin', 'r').read()
# again drops, alters \r on input 'a\x00b\nc\n\nd'
The short story to remember here is that you should
generally use \n
to refer to
end-line in all your text file content, and you should always open
binary data in binary file modes to suppress both end-of-line
translations and any Unicode encodings. A file’s content generally
determines its open mode, and file open modes usually process file
content exactly as we want.
Keep in mind, though, that you might also need to use binary file modes for text in special contexts. For instance, in Chapter 6’s examples, we’ll sometimes open text files in binary mode to avoid possible Unicode decoding errors, for files generated on arbitrary platforms that may have been encoded in arbitrary ways. Doing so avoids encoding errors, but also can mean that some text might not work as expected—searches might not always be accurate when applied to such raw text, since the search key must be in bytes string formatted and encoded according to a specific and possibly incompatible encoding scheme.
In Chapter 11’s PyEdit, we’ll
also need to catch Unicode exceptions in a “grep” directory file
search utility, and we’ll go further to allow Unicode encodings to
be specified for file content across entire trees. Moreover, a
script that attempts to translate between different platforms’
end-of-line character conventions explicitly may need to read text
in binary mode to retain the original line-end representation
truly present in the file; in text mode, they would already be
translated to \n
by the time
they reached the script.
It’s also possible to disable or further tailor end-of-line
translations in text mode with additional open
arguments we will finesse here. See
the newline
argument in
open
reference documentation
for details; in short, passing an empty string to this argument
also prevents line-end translation but retains other text-mode
behavior. For this chapter, let’s turn next to two common use
cases for binary data files: packed binary data and random
access.
Parsing packed binary data with the struct module
By using the letter b in the open
call, you can open binary datafiles
in a platform-neutral way and read and write their content with
normal file object methods. But how do you process binary data
once it has been read? It will be returned to your script as a
simple string of bytes, most of which are probably not printable
characters.
If you just need to pass binary data along to another file
or program, your work is done—for instance, simply pass the
byte string to another file opened in binary mode. And if you just
need to extract a number of bytes from a specific position, string
slicing will do the job; you can even follow up with bitwise
operations if you need to. To get at the contents of binary data
in a structured way, though, as well as to construct its contents,
the standard library struct
module is a more powerful alternative.
The struct
module
provides calls to pack and unpack binary data, as though the data
was laid out in a C-language struct
declaration. It is also capable
of composing and decomposing using any endian-ness you
desire (endian-ness determines whether the most
significant bits of binary numbers are on the left or right side).
Building a binary datafile, for instance, is straightforward—pack
Python values into a byte string and write them to a file. The
format string here in the pack
call means big-endian (>
),
with an integer, four-character string, half integer, and
floating-point number:
>>>import struct
>>>data = struct.pack('>i4shf', 2, 'spam', 3, 1.234)
>>>data
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6' >>>file = open('data.bin', 'wb')
>>>file.write(data)
14 >>>file.close()
Notice how the struct
module returns a bytes string: we’re in the realm of binary data
here, not text, and must use binary mode files to store. As usual,
Python displays most of the packed binary data’s bytes here with
\xNN
hexadecimal escape
sequences, because the bytes are not printable characters. To
parse data like that which we just produced, read it off the file
and pass it to the struct
module with the same format string—you get back a tuple containing
the values parsed out of the string and converted to Python
objects:
>>>import struct
>>>file = open('data.bin', 'rb')
>>>data = file.read()
>>>values = struct.unpack('>>i4shf', data)
>>>values
(2, b'spam', 3, 1.2339999675750732)
Parsed-out strings are byte strings again, and we can apply string and bitwise operations to probe deeper:
>>>bin(values[0] | 0b1)
# accessing bits and bytes '0b11' >>>values[1], list(values[1]), values[1][0]
(b'spam', [115, 112, 97, 109], 115)
Also note that slicing comes in handy in this domain; to
grab just the four-character string in the middle of the packed
binary data we just read, we can simply slice it out. Numeric
values could similarly be sliced out and then passed to struct.unpack
for conversion:
>>>bytes
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6' >>>bytes[4:8]
b'spam' >>>number = bytes[8:10]
>>>number
b'\x00\x03' >>>struct.unpack('>h', number)
(3,)
Packed binary data crops up in many contexts, including some
networking tasks, and in data produced by other programming
languages. Because it’s not part of every programming job’s
description, though, we’ll defer to the struct
module’s entry in the Python
library manual for more details.
Random access files
Binary files also typically see action in random access
processing. Earlier, we mentioned that adding a +
to the open
mode string allows a file to be
both read and written. This mode is typically used in conjunction
with the file object’s seek
method to support random read/write access. Such
flexible file processing modes allow us to read bytes from one
location, write to another, and so on. When scripts combine this
with binary file modes, they may fetch and update arbitrary bytes
within a file.
We used seek
earlier to
rewind files instead of closing and reopening. As mentioned, read
and write operations always take place at the current position in
the file; files normally start at offset 0 when opened and advance
as data is transferred. The seek
call lets us move to a new position
for the next transfer operation by passing in a byte
offset.
Python’s seek
method also
accepts an optional second argument that has one of three values—0
for absolute file positioning (the default); 1 to seek relative to
the current position; and 2 to seek relative to the file’s end.
That’s why passing just an offset of 0 to seek
is roughly a file
rewind operation: it repositions the file to
its absolute start. In general, seek
supports random access on a
byte-offset basis. Seeking to a multiple of a record’s size in a
binary file, for instance, allows us to fetch a record by its
relative position.
Although you can use seek
without +
modes in open
(e.g., to just read from random
locations), it’s most flexible when combined with input/output
files. And while you can perform random access in text
mode, too, the fact that text modes perform Unicode
encodings and line-end translations make them difficult to use
when absolute byte offsets and lengths are required for seeks and
reads—your data may look very different when stored in files. Text
mode may also make your data nonportable to platforms with
different default encodings, unless you’re willing to always
specify an explicit encoding for opens. Except for simple
unencoded ASCII text without line-ends, seek
tends to works best with binary
mode files.
To demonstrate, let’s create a file in w+b
mode (equivalent to wb+
) and write some data to it; this
mode allows us to both read and write, but initializes the file to
be empty if it’s already present (all w
modes do). After writing some data, we
seek back to file start to read its content (some integer return
values are omitted in this example again for brevity):
>>>records = [bytes([char] * 8) for char in b'spam']
>>>records
[b'ssssssss', b'pppppppp', b'aaaaaaaa', b'mmmmmmmm'] >>>file = open('random.bin', 'w+b')
>>>for rec in records:
# write four records ...size = file.write(rec)
# bytes for binary mode ... >>>file.flush()
>>>pos = file.seek(0)
# read entire file >>>print(file.read()
) b'ssssssssppppppppaaaaaaaammmmmmmm'
Now, let’s reopen our file in r+b
mode; this mode allows both reads
and writes again, but does not initialize the file to be empty.
This time, we seek and read in multiples of the size of data items
(“records”) stored, to both fetch and update them at
random:
c:\temp>python
>>>file = open('random.bin', 'r+b')
>>>print(file.read())
# read entire file b'ssssssssppppppppaaaaaaaammmmmmmm' >>>record = b'X' * 8
>>>file.seek(0)
# update first record >>>file.write(record)
>>>file.seek(len(record) * 2)
# update third record >>>file.write(b'Y' * 8)
>>>file.seek(8)
>>>file.read(len(record))
# fetch second record b'pppppppp' >>>file.read(len(record))
# fetch next (third) record b'YYYYYYYY' >>>file.seek(0)
# read entire file >>>file.read()
b'XXXXXXXXppppppppYYYYYYYYmmmmmmmm' c:\temp>type random.bin
# the view outside Python XXXXXXXXppppppppYYYYYYYYmmmmmmmm
Finally, keep in mind that seek
can be used to achieve random
access, even if it’s just for input. The following seeks in
multiples of record size to read (but not write) fixed-length
records at random. Notice that it also uses r
text mode: since this data is simple
ASCII text bytes and has no line-ends, text and binary modes work
the same on this platform:
c:\temp>python
>>>file = open('random.bin', 'r')
# text mode ok if no encoding/endlines >>>reclen = 8
>>>file.seek(reclen * 3)
# fetch record 4 >>>file.read(reclen)
'mmmmmmmm' >>>file.seek(reclen * 1)
# fetch record 2 >>>file.read(reclen)
'pppppppp' >>>file = open('random.bin', 'rb')
# binary mode works the same here >>>file.seek(reclen * 2)
# fetch record 3 >>>file.read(reclen)
# returns byte strings b'YYYYYYYY'
But unless your file’s content is always a simple unencoded text form like ASCII and has no translated line-ends, text mode should not generally be used if you are going to seek—line-ends may be translated on Windows and Unicode encodings may make arbitrary transformations, both of which can make absolute seek offsets difficult to use. In the following, for example, the positions of characters after the first non-ASCII no longer match between the string in Python and its encoded representation on the file:
>>>data = 'sp\xe4m'
# data to your script >>>data, len(data)
# 4 unicode chars, 1 nonascii ('späm', 4) >>>data.encode('utf8'), len(data.encode('utf8'))
# bytes written to file (b'sp\xc3\xa4m', 5) >>>f = open('test', mode='w+', encoding='utf8')
# use text mode, encoded >>>f.write(data)
>>>f.flush()
>>>f.seek(0); f.read(1)
# ascii bytes work 's' >>>f.seek(2); f.read(1)
# as does 2-byte nonascii 'ä' >>>data[3]
# but offset 3 is not 'm' ! 'm' >>>f.seek(3); f.read(1)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0: unexpected code byte
As you can see, Python’s file modes provide flexible file
processing for programs that require it. In fact, the os
module offers even more file
processing options, as the next section describes.
Lower-Level File Tools in the os Module
The os
module contains an additional set of file-processing
functions that are distinct from the built-in file
object tools demonstrated in previous examples.
For instance, here is a partial list of os
file-related calls:
Technically, os
calls
process files by their descriptors, which are
integer codes or “handles” that identify files in the operating
system. Descriptor-based files deal in raw bytes, and have no notion
of the line-end or Unicode translations for text that we studied in
the prior section. In fact, apart from extras like buffering,
descriptor-based files generally correspond to binary mode file
objects, and we similarly read and write bytes
strings, not str
strings. However, because the
descriptor-based file tools in os
are lower level and more complex than the built-in file objects
created with the built-in open
function, you should generally use the latter for all but very
special file-processing needs.[9]
Using os.open files
To give you the general flavor of this tool set, though, let’s run a few
interactive experiments. Although built-in file objects and
os
module descriptor files are
processed with distinct tool sets, they are in fact related—the
file system used by file objects simply adds a layer of logic on
top of descriptor-based files.
In fact, the fileno
file
object method returns the integer descriptor associated with a
built-in file object. For instance, the standard stream file
objects have descriptors 0, 1, and 2; calling the os.write
function to send data to
stdout
by descriptor has the
same effect as calling the sys.stdout.write
method:
>>>import sys
>>>for stream in (sys.stdin, sys.stdout, sys.stderr):
...print(stream.fileno())
... 0 1 2 >>>sys.stdout.write('Hello stdio world\n')
# write via file method Hello stdio world 18 >>>import os
>>>os.write(1, b'Hello descriptor world\n')
# write via os module Hello descriptor world 23
Because file objects we open explicitly behave the same way,
it’s also possible to process a given real external file on the
underlying computer through the built-in open
function, tools in the os
module, or both (some integer return
values are omitted here for brevity):
>>>file = open(r'C:\temp\spam.txt', 'w')
# create external file, object >>>file.write('Hello stdio file\n')
# write via file object method >>>file.flush()
# else os.write to disk first! >>>fd = file.fileno()
# get descriptor from object >>>fd
3 >>>import os
>>>os.write(fd, b'Hello descriptor file\n')
# write via os module >>>file.close()
C:\temp>type spam.txt
# lines from both schemes Hello stdio file Hello descriptor file
os.open mode flags
So why the extra file tools in os
? In short, they give more low-level
control over file processing. The built-in open
function is easy to use, but it may
be limited by the underlying filesystem that it uses, and it adds
extra behavior that we do not want. The os
module lets scripts be more
specific—for example, the following opens a descriptor-based file
in read-write and binary modes by performing a binary “or” on two
mode flags exported by os
:
>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>os.read(fdfile, 20)
b'Hello stdio file\r\nHe' >>>os.lseek(fdfile, 0, 0)
# go back to start of file >>>os.read(fdfile, 100)
# binary mode retains "\r\n" b'Hello stdio file\r\nHello descriptor file\n' >>>os.lseek(fdfile, 0, 0)
>>>os.write(fdfile, b'HELLO')
# overwrite first 5 bytes 5 C:\temp>type spam.txt
HELLO stdio file Hello descriptor file
In this case, binary mode strings rb+
and r+b
in the basic open
call are equivalent:
>>>file = open(r'C:\temp\spam.txt', 'rb+')
# same but with open/objects >>>file.read(20)
b'HELLO stdio file\r\nHe' >>>file.seek(0)
>>>file.read(100)
b'HELLO stdio file\r\nHello descriptor file\n' >>>file.seek(0)
>>>file.write(b'Jello')
5 >>>file.seek(0)
>>>file.read()
b'Jello stdio file\r\nHello descriptor file\n'
But on some systems, os.open
flags let us specify more
advanced things like exclusive access
(O_EXCL
) and
nonblocking modes (O_NONBLOCK
) when a file is opened. Some
of these flags are not portable across platforms (another reason
to use built-in file objects most of the time); see the library
manual or run a dir(os)
call on
your machine for an exhaustive list of other open flags
available.
One final note here: using os.open
with the O_EXCL
flag is the most portable way to
lock files for concurrent updates or other
process synchronization in Python today. We’ll see contexts where
this can matter in the next chapter, when we begin to explore
multiprocessing tools.
Programs running in parallel on a server machine, for instance,
may need to lock files before performing updates, if multiple
threads or processes might attempt such updates at the same
time.
Wrapping descriptors in file objects
We saw earlier how to go from file object to field descriptor with the
fileno
file object method;
given a descriptor, we can use os
module tools for lower-level file
access to the underlying file. We can also go the other
way—the os.fdopen
call
wraps a file descriptor in a file object. Because conversions work
both ways, we can generally use either tool set—file object or
os
module:
>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>fdfile
3 >>>objfile = os.fdopen(fdfile, 'rb')
>>>objfile.read()
b'Jello stdio file\r\nHello descriptor file\n'
In fact, we can wrap a file descriptor in either a binary or
text-mode file object: in text mode, reads and writes perform the
Unicode encodings and line-end translations we studied earlier and
deal in str
strings instead of
bytes
:
C:\...\PP4E\System>python
>>>import os
>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>objfile = os.fdopen(fdfile, 'r')
>>>objfile.read()
'Jello stdio file\nHello descriptor file\n'
In Python 3.X, the built-in open
call also accepts a file descriptor
instead of a file name string; in this mode it works much like
os.fdopen
, but gives you
greater control—for example, you can use additional arguments to
specify a nondefault Unicode encoding for text and suppress the
default descriptor close. Really, though, os.fdopen
accepts the same extra-control
arguments in 3.X, because it has been redefined to do little but
call back to the built-in open
(see os.py in the standard
library):
C:\...\PP4E\System>python
>>>import os
>>>fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>fdfile
3 >>>objfile = open(fdfile, 'r', encoding='latin1', closefd=False)
>>>objfile.read()
'Jello stdio file\nHello descriptor file\n' >>>objfile = os.fdopen(fdfile, 'r', encoding='latin1', closefd=True)
>>>objfile.seek(0)
>>>objfile.read()
'Jello stdio file\nHello descriptor file\n'
We’ll make use of this file object wrapper technique to
simplify text-oriented pipes and other descriptor-like objects
later in this book (e.g., sockets have a makefile
method which achieves similar
effects).
Other os module file tools
The os
module also
includes an assortment of file tools that accept a file pathname
string and accomplish file-related tasks such as renaming
(os.rename
), deleting (os.remove
), and changing the file’s
owner and permission settings (os.chown
, os.chmod
). Let’s step through a few
examples of these tools in action:
>>> os.chmod('spam.txt', 0o777)
# enable all accesses
This os.chmod
file permissions call passes a 9-bit string composed
of three sets of three bits each. From left to right, the three
sets represent the file’s owning user, the file’s group, and all
others. Within each set, the three bits reflect read, write, and
execute access permissions. When a bit is “1” in this string, it
means that the corresponding operation is allowed for the
assessor. For instance, octal 0777 is a string of nine “1” bits in
binary, so it enables all three kinds of accesses for all three
user groups; octal 0600 means that the file can be read and
written only by the user that owns it (when written in binary,
0600 octal is really bits 110 000 000).
This scheme stems from Unix file permission settings, but the call works on Windows as well. If it’s puzzling, see your system’s documentation (e.g., a Unix manpage) for chmod. Moving on:
>>>os.rename(r'C:\temp\spam.txt', r'C:\temp\eggs.txt')
# from, to >>>os.remove(r'C:\temp\spam.txt')
# delete file? WindowsError: [Error 2] The system cannot find the file specified: 'C:\\temp\\...' >>>os.remove(r'C:\temp\eggs.txt')
The os.rename
call used here changes a file’s name; the os.remove
file
deletion call deletes a file from your system and is synonymous
with os.unlink
(the
latter reflects the call’s name on Unix but was obscure to
users of other platforms).[10] The os
module
also exports the stat
system
call:
>>>open('spam.txt', 'w').write('Hello stat world\n')
# +1 for \r added 17 >>>import os
>>>info = os.stat(r'C:\temp\spam.txt')
>>>info
nt.stat_result(st_mode=33206, st_ino=0, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=18, st_atime=1267645806, st_mtime=1267646072, st_ctime=1267645806) >>>info.st_mode, info.st_size
# via named-tuple item attr names (33206, 18) >>>import stat
>>>info[stat.ST_MODE], info[stat.ST_SIZE]
# via stat module presets (33206, 18) >>>stat.S_ISDIR(info.st_mode), stat.S_ISREG(info.st_mode)
(False, True)
The os.stat
call returns a tuple of values (really, in 3.X a
special kind of tuple with named items) giving low-level
information about the named file, and the stat
module exports constants and
functions for querying this information in a portable way. For
instance, indexing an os.stat
result on offset stat.ST_SIZE
returns the file’s size, and calling stat.S_ISDIR
with the mode item from an
os.stat
result checks whether
the file is a directory. As shown earlier, though, both of these
operations are available in the os.path
module, too, so it’s rarely
necessary to use os.stat
except
for low-level file queries:
>>>path = r'C:\temp\spam.txt'
>>>os.path.isdir(path), os.path.isfile(path), os.path.getsize(path)
(False, True, 18)
File Scanners
Before we leave our file tools survey, it’s time for something that performs a more tangible task and illustrates some of what we’ve learned so far. Unlike some shell-tool languages, Python doesn’t have an implicit file-scanning loop procedure, but it’s simple to write a general one that we can reuse for all time. The module in Example 4-1 defines a general file-scanning routine, which simply applies a passed-in Python function to each line in an external file.
def scanner(name, function): file = open(name, 'r') # create a file object while True: line = file.readline() # call file methods if not line: break # until end-of-file function(line) # call a function object file.close()
The scanner
function
doesn’t care what line-processing function is passed
in, and that accounts for most of its generality—it is happy to
apply any single-argument function that exists
now or in the future to all of the lines in a text file. If we code
this module and put it in a directory on the module search path, we
can use it any time we need to step through a file line by line.
Example 4-2 is a client
script that does simple line translations.
#!/usr/local/bin/python from sys import argv from scanfile import scanner class UnknownCommand(Exception): pass def processLine(line): # define a function if line[0] == '*': # applied to each line print("Ms.", line[1:-1]) elif line[0] == '+': print("Mr.", line[1:-1]) # strip first and last char: \n else: raise UnknownCommand(line) # raise an exception filename = 'data.txt' if len(argv) == 2: filename = argv[1] # allow filename cmd arg scanner(filename, processLine) # start the scanner
The text file hillbillies.txt contains the following lines:
*Granny +Jethro *Elly May +"Uncle Jed"
and our commands script could be run as follows:
C:\...\PP4E\System\Filetools> python commands.py hillbillies.txt
Ms. Granny
Mr. Jethro
Ms. Elly May
Mr. "Uncle Jed"
This works, but there are a variety of coding alternatives for
both files, some of which may be better than those listed above. For
instance, we could also code the command processor of Example 4-2 in the following way;
especially if the number of command options starts to become large,
such a data-driven approach may be more concise and easier to maintain than a large
if
statement with essentially
redundant actions (if you ever have to change the way output lines
print, you’ll have to change it in only one place with this
form):
commands = {'*': 'Ms.', '+': 'Mr.'} # data is easier to expand than code? def processLine(line): try: print(commands[line[0]], line[1:-1]) except KeyError: raise UnknownCommand(line)
The scanner could similarly be improved. As a rule of thumb,
we can also usually speed things up by shifting processing from
Python code to built-in tools. For instance, if we’re concerned with
speed, we can probably make our file scanner faster by using the
file’s line iterator to step through the file
instead of the manual readline
loop in Example 4-1 (though
you’d have to time this with your Python to be sure):
def scanner(name, function): for line in open(name, 'r'): # scan line by line function(line) # call a function object
And we can work more magic in Example 4-1 with the iteration
tools like the map
built-in
function, the list comprehension expression, and the generator
expression. Here are three minimalist’s versions; the for
loop is replaced by map
or a comprehension, and we let Python
close the file for us when it is garbage collected or the script
exits (these all build a temporary list of results along the way to
run through their iterations, but this overhead is likely trivial
for all but the largest of files):
def scanner(name, function): list(map(function, open(name, 'r'))) def scanner(name, function): [function(line) for line in open(name, 'r')] def scanner(name, function): list(function(line) for line in open(name, 'r'))
File filters
The preceding works as planned, but what if we also want to change a file while scanning it? Example 4-3 shows two approaches: one uses explicit files, and the other uses the standard input/output streams to allow for redirection on the command line.
import sys def filter_files(name, function): # filter file through function input = open(name, 'r') # create file objects output = open(name + '.out', 'w') # explicit output file too for line in input: output.write(function(line)) # write the modified line input.close() output.close() # output has a '.out' suffix def filter_stream(function): # no explicit files while True: # use standard streams line = sys.stdin.readline() # or: input() if not line: break print(function(line), end='') # or: sys.stdout.write() if __name__ == '__main__': filter_stream(lambda line: line) # copy stdin to stdout if run
Notice that the newer context managers feature discussed earlier could save us a few lines here in the file-based filter of Example 4-3, and also guarantee immediate file closures if the processing function fails with an exception:
def filter_files(name, function): with open(name, 'r') as input, open(name + '.out', 'w') as output: for line in input: output.write(function(line)) # write the modified line
And again, file object line iterators could simplify the stream-based filter’s code in this example as well:
def filter_stream(function): for line in sys.stdin: # read by lines automatically print(function(line), end='')
Since the standard streams are preopened for us, they’re
often easier to use. When run standalone, it simply parrots
stdin
to stdout
:
C:\...\PP4E\System\Filetools> filters.py < hillbillies.txt
*Granny
+Jethro
*Elly May
+"Uncle Jed"
But this module is also useful when imported as a library (clients provide the line-processing function):
>>>from filters import filter_files
>>>filter_files('hillbillies.txt', str.upper)
>>>print(open('hillbillies.txt.out').read())
*GRANNY +JETHRO *ELLY MAY +"UNCLE JED"
We’ll see files in action often in the remainder of this book, especially in the more complete and functional system examples of Chapter 6. First though, we turn to tools for processing our files’ home.
Directory Tools
One of the more common tasks in the shell utilities domain is applying an operation to a set of files in a directory—a “folder” in Windows-speak. By running a script on a batch of files, we can automate (that is, script) tasks we might have to otherwise run repeatedly by hand.
For instance, suppose you need to search all of your Python
files in a development directory for a global variable name (perhaps
you’ve forgotten where it is used). There are many platform-specific
ways to do this (e.g., the find
and
grep
commands in Unix), but Python
scripts that accomplish such tasks will work on every platform where
Python works—Windows, Unix, Linux, Macintosh, and just about any other
platform commonly used today. If you simply copy your script to any
machine you wish to use it on, it will work regardless of which other
tools are available there; all you need is Python. Moreover, coding
such tasks in Python also allows you to perform arbitrary actions
along the way—replacements, deletions, and whatever else you can code
in the Python language.
Walking One Directory
The most common way to go about writing such tools is to first grab a
list of the names of the files you wish to process, and then step
through that list with a Python for
loop or other iteration tool,
processing each file in turn. The trick we need to learn here, then,
is how to get such a directory list within our scripts. For scanning
directories there are at least three options: running shell listing
commands with os.popen
, matching
filename patterns with glob.glob
,
and getting directory listings with os.listdir
. They vary in interface, result
format, and portability.
Running shell listing commands with os.popen
How did you go about getting directory file listings before you heard of Python? If you’re new to shell tools programming, the answer may be “Well, I started a Windows file explorer and clicked on things,” but I’m thinking here in terms of less GUI-oriented command-line mechanisms.
On Unix, directory listings are usually obtained by typing
ls
in a shell; on Windows, they
can be generated with a dir
command typed in an MS-DOS console box. Because Python scripts may
use os.popen
to run any command
line that we can type in a shell, they are the most general way to
grab a directory listing inside a Python program. We met os.popen
in the prior chapters; it runs
a shell command string and gives us a file object from which we
can read the command’s output. To illustrate, let’s first assume
the following directory structures—I have both the usual dir
and a Unix-like ls
command from Cygwin on my Windows
laptop:
c:\temp>dir /B
parts PP3E random.bin spam.txt temp.bin temp.txt c:\temp>c:\cygwin\bin\ls
PP3E parts random.bin spam.txt temp.bin temp.txt c:\temp>c:\cygwin\bin\ls parts
part0001 part0002 part0003 part0004
The parts and PP3E names are a nested subdirectory in C:\temp here (the latter is a copy of the prior edition’s examples tree, which I used occasionally in this text). Now, as we’ve seen, scripts can grab a listing of file and directory names at this level by simply spawning the appropriate platform-specific command line and reading its output (the text normally thrown up on the console window):
C:\temp>python
>>>import os
>>>os.popen('dir /B').readlines()
['parts\n', 'PP3E\n', 'random.bin\n', 'spam.txt\n', 'temp.bin\n', 'temp.txt\n']
Lines read from a shell command come back with a trailing
end-of-line character, but it’s easy enough to slice it off; the
os.popen
result also gives us a
line iterator just like normal files:
>>>for line in os.popen('dir /B'):
...print(line[:-1])
... parts PP3E random.bin spam.txt temp.bin temp.txt >>>lines = [line[:-1] for line in os.popen('dir /B')]
>>>lines
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
For pipe objects, the effect of iterators may be even more
useful than simply avoiding loading the entire result into memory
all at once: readlines
will
always block the caller until the spawned program is completely
finished, whereas the iterator might not.
The dir
and ls
commands let us be specific about filename patterns
to be matched and directory names to be listed by using name
patterns; again, we’re just running shell commands here, so
anything you can type at a shell prompt goes:
>>>os.popen('dir *.bin /B').readlines()
['random.bin\n', 'temp.bin\n'] >>>os.popen(r'c:\cygwin\bin\ls *.bin').readlines()
['random.bin\n', 'temp.bin\n'] >>>list(os.popen(r'dir parts /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n'] >>>[fname for fname in os.popen(r'c:\cygwin\bin\ls parts')]
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
These calls use general tools and work as advertised. As I
noted earlier, though, the downsides of os.popen
are that it requires using a
platform-specific shell command and it incurs a performance hit to
start up an independent program. In fact, different listing tools
may sometimes produce different results:
>>>list(os.popen(r'dir parts\part* /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n'] >>> >>>list(os.popen(r'c:\cygwin\bin\ls parts/part*'))
['parts/part0001\n', 'parts/part0002\n', 'parts/part0003\n', 'parts/part0004\n']
The next two alternative techniques do better on both counts.
The glob module
The term globbing comes from the *
wildcard
character in filename patterns; per computing folklore, a *
matches a “glob” of characters. In
less poetic terms, globbing simply means collecting the names of
all entries in a directory—files and subdirectories—whose names
match a given filename pattern. In Unix shells, globbing expands
filename patterns within a command line into all matching
filenames before the command is ever run. In Python, we can do
something similar by calling the glob.glob
built-in—a tool that accepts a filename pattern to expand, and
returns a list (not a generator) of matching file names:
>>>import glob
>>>glob.glob('*')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt'] >>>glob.glob('*.bin')
['random.bin', 'temp.bin'] >>>glob.glob('parts')
['parts'] >>>glob.glob('parts/*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004'] >>>glob.glob('parts\part*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']
The glob
call accepts the
usual filename pattern syntax used in shells: ?
means any one character, *
means any number of characters, and
[]
is a character selection
set.[11] The pattern should include a directory path if you
wish to glob in something other than the current working
directory, and the module accepts either Unix or DOS-style
directory separators (/
or
\
). This call is implemented
without spawning a shell command (it uses os.listdir
, described in the next
section) and so is likely to be faster and more portable and
uniform across all Python platforms than the os.popen
schemes shown earlier.
Technically speaking, glob
is a bit more powerful than
described so far. In fact, using it to list files in one directory
is just one use of its pattern-matching skills. For instance, it
can also be used to collect matching names across multiple
directories, simply because each level in a passed-in directory
path can be a pattern too:
>>> for path in glob.glob(r'PP3E\Examples\PP3E\*\s*.py'): print(path)
...
PP3E\Examples\PP3E\Lang\summer-alt.py
PP3E\Examples\PP3E\Lang\summer.py
PP3E\Examples\PP3E\PyTools\search_all.py
Here, we get back filenames from two different directories
that match the s*.py
pattern;
because the directory name preceding this is a *
wildcard, Python collects all possible
ways to reach the base filenames. Using os.popen
to spawn shell commands
achieves the same effect, but only if the underlying shell or
listing command does, too, and with possibly different result
formats across tools and platforms.
The os.listdir call
The os
module’s listdir
call provides yet another way to collect filenames
in a Python list. It takes a simple directory name string, not a
filename pattern, and returns a list containing the names of all
entries in that directory—both simple files and nested directories—for use in the calling
script:
>>>import os
>>>os.listdir('.')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt'] >>> >>>os.listdir(os.curdir)
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt'] >>> >>>os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']
This, too, is done without resorting to shell commands and
so is both fast and portable to all major Python platforms. The
result is not in any particular order across platforms (but can be
sorted with the list sort
method or sorted
built-in
function); returns base filenames without their directory path
prefixes; does not include names “.” or “..” if present; and
includes names of both files and directories at the listed
level.
To compare all three listing techniques, let’s run them here
side by side on an explicit directory. They differ in some ways
but are mostly just variations on a theme for this task—os.popen
returns end-of-lines and may
sort filenames on some platforms, glob.glob
accepts a pattern and returns
filenames with directory prefixes, and os.listdir
takes a simple directory name
and returns names without directory prefixes:
>>>os.popen('dir /b parts').readlines()
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n'] >>>glob.glob(r'parts\*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004'] >>>os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']
Of these three, glob
and
listdir
are generally better
options if you care about script portability and result
uniformity, and listdir
seems
fastest in recent Python releases (but gauge its performance
yourself—implementations may change over time).
Splitting and joining listing results
In the last example, I pointed out that glob
returns names with directory paths,
whereas listdir
gives raw base
filenames. For convenient processing, scripts often need to split
glob
results into base files or
expand listdir
results into
full paths. Such translations are easy if we let the os.path
module do all the work for us.
For example, a script that intends to copy all files elsewhere
will typically need to first split off the base filenames from
glob
results so that it can add
different directory names on the front:
>>>dirname = r'C:\temp\parts'
>>> >>>import glob
>>>for file in glob.glob(dirname + '/*'):
...head, tail = os.path.split(file)
...print(head, tail, '=>', ('C:\\Other\\' + tail))
... C:\temp\parts part0001 => C:\Other\part0001 C:\temp\parts part0002 => C:\Other\part0002 C:\temp\parts part0003 => C:\Other\part0003 C:\temp\parts part0004 => C:\Other\part0004
Here, the names after the =>
represent names that files might
be moved to. Conversely, a script that means to process all files
in a different directory than the one it runs in will probably
need to prepend listdir
results
with the target directory name before passing filenames on to
other tools:
>>>import os
>>>for file in os.listdir(dirname):
...print(dirname, file, '=>', os.path.join(dirname, file))
... C:\temp\parts part0001 => C:\temp\parts\part0001 C:\temp\parts part0002 => C:\temp\parts\part0002 C:\temp\parts part0003 => C:\temp\parts\part0003 C:\temp\parts part0004 => C:\temp\parts\part0004
When you begin writing realistic directory processing tools of the sort we’ll develop in Chapter 6, you’ll find these calls to be almost habit.
Walking Directory Trees
You may have noticed that almost all of the techniques in this section so far return the names of files in only a single directory (globbing with more involved patterns is the only exception). That’s fine for many tasks, but what if you want to apply an operation to every file in every directory and subdirectory in an entire directory tree?
For instance, suppose again that we need to find every occurrence of a global name in our Python scripts. This time, though, our scripts are arranged into a module package: a directory with nested subdirectories, which may have subdirectories of their own. We could rerun our hypothetical single-directory searcher manually in every directory in the tree, but that’s tedious, error prone, and just plain not fun.
Luckily, in Python it’s almost as easy to process a directory
tree as it is to inspect a single directory. We can either write a
recursive routine to traverse the tree, or use a tree-walker utility
built into the os
module. Such
tools can be used to search, copy, compare, and otherwise process
arbitrary directory trees on any platform that Python runs on (and
that’s just about everywhere).
The os.walk visitor
To make it easy to apply an operation to all files in a complete
directory tree, Python comes with a utility that scans trees for
us and runs code we provide at every directory along the way: the
os.walk
function is called with
a directory root name and automatically walks the entire tree at
root and below.
Operationally, os.walk
is
a generator function—at each
directory in the tree, it yields a three-item tuple, containing
the name of the current directory as well as lists of both all the
files and all the subdirectories in the current directory. Because
it’s a generator, its walk is usually run by a for
loop (or other iteration tool); on
each iteration, the walker advances to the next subdirectory, and
the loop runs its code for the next level of the tree (for
instance, opening and searching all the files at that
level).
That description might sound complex the first time you hear
it, but os.walk
is fairly
straightforward once you get the hang of it. In the following, for
example, the loop body’s code is run for each directory in the
tree rooted at the current working directory (.
). Along the way, the loop simply
prints the directory name and all the files at the current level
after prepending the directory name. It’s simpler in Python than
in English (I removed the PP3E subdirectory for this test to keep
the output short):
>>>import os
>>>for (dirname, subshere, fileshere) in os.walk('.'):
...print('[' + dirname + ']')
...for fname in fileshere:
...print(os.path.join(dirname, fname))
# handle one file ... [.] .\random.bin .\spam.txt .\temp.bin .\temp.txt [.\parts] .\parts\part0001 .\parts\part0002 .\parts\part0003 .\parts\part0004
In other words, we’ve coded our own custom and easily changed recursive directory listing tool in Python. Because this may be something we would like to tweak and reuse elsewhere, let’s make it permanently available in a module file, as shown in Example 4-4, now that we’ve worked out the details interactively.
"list file tree with os.walk" import sys, os def lister(root): # for a root dir for (thisdir, subshere, fileshere) in os.walk(root): # generate dirs in tree print('[' + thisdir + ']') for fname in fileshere: # print files in this dir path = os.path.join(thisdir, fname) # add dir name prefix print(path) if __name__ == '__main__': lister(sys.argv[1]) # dir name in cmdline
When packaged this way, the code can also be run from a shell command line. Here it is being launched with the root directory to be listed passed in as a command-line argument:
C:\...\PP4E\System\Filetools> python lister_walk.py C:\temp\test
[C:\temp\test]
C:\temp\test\random.bin
C:\temp\test\spam.txt
C:\temp\test\temp.bin
C:\temp\test\temp.txt
[C:\temp\test\parts]
C:\temp\test\parts\part0001
C:\temp\test\parts\part0002
C:\temp\test\parts\part0003
C:\temp\test\parts\part0004
Here’s a more involved example of os.walk
in action. Suppose you have a
directory tree of files and you want to find all Python source
files within it that reference the mimetypes
module we’ll study in Chapter 6. The following is one
(albeit hardcoded and overly specific) way to accomplish this
task:
>>>import os
>>>matches = []
>>>for (dirname, dirshere, fileshere) in os.walk(r'C:\temp\PP3E\Examples'):
...for filename in fileshere:
...if filename.endswith('.py'):
...pathname = os.path.join(dirname, filename)
...if 'mimetypes' in open(pathname).read():
...matches.append(pathname)
... >>>for name in matches: print(name)
... C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailParser.py C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailSender.py C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat.py C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat_modular.py C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\ftptools.py C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\uploadflat.py C:\temp\PP3E\Examples\PP3E\System\Media\playfile.py
This code loops through all the files at each level, looking
for files with .py at the end of their names
and which contain the search string. When a match is found, its
full name is appended to the results list object; alternatively,
we could also simply build a list of all .py
files and search each in a for
loop after the walk. Since we’re going to code much more general
solution to this type of problem in Chapter 6, though, we’ll let this
stand for now.
If you want to see what’s really going on in the os.walk
generator, call its __next__
method (or equivalently, pass
it to the next
built-in
function) manually a few times, just as the for
loop does automatically; each time,
you advance to the next subdirectory in the tree:
>>>gen = os.walk(r'C:\temp\test')
>>>gen.__next__()
('C:\\temp\\test', ['parts'], ['random.bin', 'spam.txt', 'temp.bin', 'temp.txt']) >>>gen.__next__()
('C:\\temp\\test\\parts', [], ['part0001', 'part0002', 'part0003', 'part0004']) >>>gen.__next__()
Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
The library manual documents os.walk
further than we will here. For
instance, it supports bottom-up instead of top-down walks with its
optional topdown=False
argument, and callers may prune tree branches by deleting names in
the subdirectories lists of the yielded tuples.
Internally, the os.walk
call generates filename lists at each level with the os.listdir
call we met earlier, which
collects both file and directory names in no particular order and
returns them without their directory paths; os.walk
segregates this list into
subdirectories and files (technically, nondirectories) before
yielding a result. Also note that walk
uses the very same subdirectories
list it yields to callers in order to later descend into
subdirectories. Because lists are mutable objects that can be
changed in place, if your code modifies the yielded subdirectory
names list, it will impact what walk
does next. For example, deleting
directory names will prune traversal branches, and sorting the
list will order the walk.
Recursive os.listdir traversals
The os.walk
tool
does the work of tree traversals for us; we simply
provide loop code with task-specific logic. However, it’s
sometimes more flexible and hardly any more work to do the walking
ourselves. The following script recodes the directory listing
script with a manual recursive traversal
function (a function that calls itself to repeat its actions). The
mylister
function in Example 4-5 is almost the same
as lister
in Example 4-4 but calls os.listdir
to generate file paths
manually and calls itself recursively to descend into
subdirectories.
# list files in dir tree by recursion import sys, os def mylister(currdir): print('[' + currdir + ']') for file in os.listdir(currdir): # list files here path = os.path.join(currdir, file) # add dir path back if not os.path.isdir(path): print(path) else: mylister(path) # recur into subdirs if __name__ == '__main__': mylister(sys.argv[1]) # dir name in cmdline
As usual, this file can be both imported and called or run as a script, though the fact that its result is printed text makes it less useful as an imported component unless its output stream is captured by another program.
When run as a script, this file’s output is equivalent to
that of Example 4-4, but
not identical—unlike the os.walk
version, our recursive walker
here doesn’t order the walk to visit files before stepping into
subdirectories. It could by looping through the filenames list
twice (selecting files first), but as coded, the order is
dependent on os.listdir
results. For most use cases, the walk order would be
irrelevant:
C:\...\PP4E\System\Filetools> python lister_recur.py C:\temp\test
[C:\temp\test]
[C:\temp\test\parts]
C:\temp\test\parts\part0001
C:\temp\test\parts\part0002
C:\temp\test\parts\part0003
C:\temp\test\parts\part0004
C:\temp\test\random.bin
C:\temp\test\spam.txt
C:\temp\test\temp.bin
C:\temp\test\temp.txt
We’ll make better use of most of this section’s techniques
in later examples in Chapter 6
and in this book at large. For example, scripts for copying and
comparing directory trees use the tree-walker techniques
introduced here. Watch for these tools in action along the way.
We’ll also code a find utility in Chapter 6 that combines the tree
traversal of os.walk
with the
filename pattern expansion of glob.glob
.
Handling Unicode Filenames in 3.X: listdir, walk, glob
Because all normal strings are Unicode in Python 3.X, the
directory and file names generated by os.listdir
, os.walk
, and glob.glob
so far in this chapter are
technically Unicode strings. This can have some ramifications if
your directories contain unusual names that might not decode
properly.
Technically, because filenames may contain arbitrary text, the
os.listdir
works in two modes in
3.X: given a bytes
argument, this
function will return filenames as encoded byte strings; given a
normal str
string argument, it
instead returns filenames as Unicode strings, decoded per the
filesystem’s encoding scheme:
C:\...\PP4E\System\Filetools>python
>>>import os
>>>os.listdir('.')[:4]
['bigext-tree.py', 'bigpy-dir.py', 'bigpy-path.py', 'bigpy-tree.py'] >>>os.listdir(b'.')[:4]
[b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py', b'bigpy-tree.py']
The byte string version can be used if undecodable file names
may be present. Because os.walk
and glob.glob
both work by
calling os.listdir
internally,
they inherit this behavior by proxy. The os.walk
tree walker, for example, calls
os.listdir
at each directory
level; passing byte string arguments suppresses decoding and returns
byte string results:
>>>for (dir, subs, files) in os.walk('..'):
print(dir)
... .. ..\Environment ..\Filetools ..\Processes >>>for (dir, subs, files) in os.walk(b'..'): print(dir)
... b'..' b'..\\Environment' b'..\\Filetools' b'..\\Processes'
The glob.glob
tool
similarly calls os.listdir
internally before applying name patterns, and so also returns
undecoded byte string names for byte string arguments:
>>>glob.glob('.\*')[:3]
['.\\bigext-out.txt', '.\\bigext-tree.py', '.\\bigpy-dir.py'] >>> >>>glob.glob(b'.\*')[:3]
[b'.\\bigext-out.txt', b'.\\bigext-tree.py', b'.\\bigpy-dir.py']
Given a normal string name (as a command-line argument, for example), you can force the issue by converting to byte strings with manual encoding to suppress decoding:
>>>name = '.'
>>>os.listdir(name.encode())[:4]
[b'bigext-out.txt', b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py']
The upshot is that if your directories may contain names which cannot be decoded according to the underlying platform’s Unicode encoding scheme, you may need to pass byte strings to these tools to avoid Unicode encoding errors. You’ll get byte strings back, which may be less readable if printed, but you’ll avoid errors while traversing directories and files.
This might be especially useful on systems that use simple encodings such as ASCII or Latin-1, but may contain files with arbitrarily encoded names from cross-machine copies, the Web, and so on. Depending upon context, exception handlers may be used to suppress some types of encoding errors as well.
We’ll see an example of how this can matter in the first section of Chapter 6, where an undecodable directory name generates an error if printed during a full disk scan (although that specific error seems more related to printing than to decoding in general).
Note that the basic open
built-in function allows the name of the file being opened to be
passed as either Unicode str
or
raw bytes
, too, though this is
used only to name the file initially; the additional mode argument
determines whether the file’s content is handled in text or binary
modes. Passing a byte string filename allows you to name files with
arbitrarily encoded names.
Unicode policies: File content versus file names
In fact, it’s important to keep in mind that there are two different Unicode concepts related to files: the encoding of file content and the encoding of file name. Python provides your platform’s defaults for these settings in two different attributes; on Windows 7:
>>>import sys
>>>sys.getdefaultencoding()
# file content encoding, platform default 'utf-8' >>>sys.getfilesystemencoding()
# file name encoding, platform scheme 'mbcs'
These settings allow you to be explicit when needed—the
content encoding is used when data is read and written to the
file, and the name encoding is used when dealing with names prior
to transferring data. In addition, using bytes
for file name tools may work
around incompatibilities with the underlying file system’s scheme,
and opening files in binary mode can suppress Unicode decoding
errors for content.
As we’ve seen, though, opening text files in binary mode may also mean that the raw and still-encoded text will not match search strings as expected: search strings must also be byte strings encoded per a specific and possibly incompatible encoding scheme. In fact, this approach essentially mimics the behavior of text files in Python 2.X, and underscores why elevating Unicode in 3.X is generally desirable—such text files sometimes may appear to work even though they probably shouldn’t. On the other hand, opening text in binary mode to suppress Unicode content decoding and avoid decoding errors might still be useful if you do not wish to skip undecodable files and content is largely irrelevant.
As a rule of thumb, you should try to always provide an encoding name for text content if it might be outside the platform default, and you should rely on the default Unicode API for file names in most cases. Again, see Python’s manuals for more on the Unicode file name story than we have space to cover fully here, and see Learning Python, Fourth Edition, for more on Unicode in general.
In Chapter 6, we’re going to put the tools we met in this chapter to realistic use. For example, we’ll apply file and directory tools to implement file splitters, testing systems, directory copies and compares, and a variety of utilities based on tree walking. We’ll find that Python’s directory tools we met here have an enabling quality that allows us to automate a large set of real-world tasks. First, though, Chapter 5 concludes our basic tool survey, by exploring another system topic that tends to weave its way into a wide variety of application domains—parallel processing in Python.
[9] For instance, to process pipes,
described in Chapter 5. The
Python os.pipe
call returns
two file descriptors, which can be processed with os
module file tools or wrapped in a
file object with os.fdopen
.
When used with descriptor-based file tools in os
, pipes deal in byte strings, not
text. Some device files may require lower-level control as
well.
[10] For related tools, see also the shutil
module in Python’s standard
library; it has higher-level tools for copying and removing
files and more. We’ll also write directory compare, copy, and
search tools of our own in Chapter 6, after we’ve had a
chance to study the directory tools presented later in this
chapter.
Get Programming Python, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.