Like most kids, mine spend a lot of time on the Internet. As far as I can tell, it’s the thing to do these days. Among this latest generation, computer geeks and gurus seem to be held with the same sort of esteem that rock stars once were by mine. When kids disappear into their rooms, chances are good that they are hacking on computers, not mastering guitar riffs. It’s probably healthier than some of the diversions of my own misspent youth, but that’s a topic for another kind of book.
But if you have teenage kids and computers, or know someone who does, you probably know that it’s not a bad idea to keep tabs on what those kids do on the Web. Type your favorite four-letter word in almost any web search engine and you’ll understand the concern -- it’s much better stuff than I could get during my teenage career. To sidestep the issue, only a few of the machines in my house have Internet feeds.
Now, while they’re on one of these machines, my kids download lots of games. To avoid infecting our Very Important Computers with viruses from public-domain games, though, my kids usually have to download games on a computer with an Internet feed, and transfer them to their own computers to install. The problem is that game files are not small; they are usually much too big to fit on a floppy (and burning a CD takes away valuable game playing time).
If all the machines in my house ran Linux, this would be a nonissue. There are standard command-line programs on Unix for chopping a file into pieces small enough to fit on a floppy (split), and others for putting the pieces back together to recreate the original file (cat). Because we have all sorts of different machines in the house, though, we needed a more portable solution.
Since all the computers in my house run Python, a simple portable Python script came to the rescue. The Python program in Example 4-1 distributes a single file’s contents among a set of part files, and stores those part files in a directory.
Example 4-1. PP2E\System\Filetools\split.py
#!/usr/bin/python ######################################################### # split a file into a set of portions; join.py puts them # back together; this is a customizable version of the # standard unix split command-line utility; because it # is written in Python, it also works on Windows and can # be easily tweaked; because it exports a function, it # can also be imported and reused in other applications; ######################################################### import sys, os kilobytes = 1024 megabytes = kilobytes * 1000 chunksize = int(1.4 * megabytes) # default: roughly a floppy def split(fromfile, todir, chunksize=chunksize): if not os.path.exists(todir): # caller handles errors os.mkdir(todir) # make dir, read/write parts else: for fname in os.listdir(todir): # delete any existing files os.remove(os.path.join(todir, fname)) partnum = 0 input = open(fromfile, 'rb') # use binary mode on Windows while 1: # eof=empty string from read chunk = input.read(chunksize) # get next part <= chunksize if not chunk: break partnum = partnum+1 filename = os.path.join(todir, ('part%04d' % partnum)) fileobj = open(filename, 'wb') fileobj.write(chunk) fileobj.close() # or simply open( ).write( ) input.close( ) assert partnum <= 9999 # join sort fails if 5 digits return partnum if __name__ == '__main__': if len(sys.argv) == 2 and sys.argv[1] == '-help': print 'Use: split.py [file-to-split target-dir [chunksize]]' else: if len(sys.argv) < 3: interactive = 1 fromfile = raw_input('File to be split? ') # input if clicked todir = raw_input('Directory to store part files? ') else: interactive = 0 fromfile, todir = sys.argv[1:3] # args in cmdline if len(sys.argv) == 4: chunksize = int(sys.argv[3]) absfrom, absto = map(os.path.abspath, [fromfile, todir]) print 'Splitting', absfrom, 'to', absto, 'by', chunksize try: parts = split(fromfile, todir, chunksize) except: print 'Error during split:' print sys.exc_type, sys.exc_value else: print 'Split finished:', parts, 'parts are in', absto if interactive: raw_input('Press Enter key') # pause if clicked
By default, this script splits the input file into chunks that are roughly the size of a floppy disk -- perfect for moving big files between electronically isolated machines. Most important, because this is all portable Python code, this script will run on just about any machine, even ones without a file splitter of their own. All it requires is an installed Python. Here it is at work splitting the Python 1.5.2 self-installer executable on Windows:
C:\temp>echo %X%
shorthand shell variable C:\PP2ndEd\examples\PP2E C:\temp>ls -l py152.exe
-rwxrwxrwa 1 0 0 5028339 Apr 16 1999 py152.exe C:\temp>python %X%\System\Filetools\split.py -help
Use: split.py [file-to-split target-dir [chunksize]] C:\temp>python %X%\System\Filetools\split.py py152.exe pysplit
Splitting C:\temp\py152.exe to C:\temp\pysplit by 1433600 Split finished: 4 parts are in C:\temp\pysplit C:\temp>ls -l pysplit
total 9821 -rwxrwxrwa 1 0 0 1433600 Sep 12 06:03 part0001 -rwxrwxrwa 1 0 0 1433600 Sep 12 06:03 part0002 -rwxrwxrwa 1 0 0 1433600 Sep 12 06:03 part0003 -rwxrwxrwa 1 0 0 727539 Sep 12 06:03 part0004
Each of these four generated part files represent one binary chunk of
file py152.exe
, small enough to fit comfortably
on a floppy disk. In fact, if you add the sizes of the generated part
files given by the ls command, you’ll come
up with 5,028,339 bytes -- exactly the same as the original
file’s size. Before we see how to put these files back together
again, let’s explore a few of the splitter script’s finer
points.
This script is designed to input its parameters in either interactive or command-line modes; it checks the number of command-line arguments to know in which mode it is being used. In command-line mode, you list the file to be split and the output directory on the command line, and can optionally override the default part file size with a third command-line argument.
In interactive mode, the script asks for a filename and output
directory at the console window with raw_input
,
and pauses for a keypress at the end before exiting. This mode is
nice when the program file is started by clicking on its
icon -- on Windows, parameters are typed into a pop-up DOS box
that doesn’t automatically disappear. The script also shows the
absolute paths of its parameters (by running them through
os.path.abspath
) because they may not be obvious
in interactive mode. We’ll see examples of other split modes at
work in a moment.
This code is careful to open both input and output files in binary
mode (rb
, wb
), because it needs
to portably handle things like executables and audio files, not just
text. In Chapter 2, we learned that on Windows,
text-mode files automatically map \r\n
end-of-line
sequences to \n
on input, and map
\n
to \r\n
on output. For true
binary data, we really don’t want any \r
characters in the data to go away when read, and we don’t want
any superfluous \r
characters to be added on
output. Binary-mode files suppress this \r
mapping
when the script is run on Windows, and so avoid data corruption.
This script also goes out of its way to manually close its files. For instance:
fileobj = open(partname, 'wb') fileobj.write(chunk) fileobj.close( )
As we also saw in Chapter 2, these three lines can usually be replaced with this single line:
open(partname, 'wb').write(chunk)
This shorter form relies on the fact that the current Python
implementation automatically closes files for you when file objects
are reclaimed (i.e., when they are garbage collected, because there
are no more references to the file object). In this line, the file
object would be reclaimed immediately, because the
open
result is temporary in an expression, and is
never referenced by a longer-lived name. The input
file similarly is reclaimed when the split
function exits.
As I was writing this chapter, though, there was some possibility
that this automatic-close behavior may go away in the
future.[32]
Moreover, the JPython Java-based Python implementation does not
reclaim unreferenced objects as immediately as the standard Python.
If you care about the Java port (or one possible future), your script
may potentially create many files in a short amount of time, and your
script may run on a machine that has a limit on the number of open
files per program, then close manually. The close
calls in this script have never been necessary for my purposes, but
because the split
function in this module is
intended to be a general-purpose tool, it accommodates such
worst-case scenarios.
Back to moving big files around the house. After downloading a big game program file, my kids generally run the previous splitter script by clicking on its name in Windows Explorer and typing filenames. After a split, they simply copy each part file onto its own floppy, walk the floppies upstairs, and recreate the split output directory on their target computer by copying files off the floppies. Finally, the script in Example 4-2 is clicked or otherwise run to put the parts back together.
Example 4-2. PP2E\System\Filetools\join.py
#!/usr/bin/python ########################################################## # join all part files in a dir created by split.py. # This is roughly like a 'cat fromdir/* > tofile' command # on unix, but is a bit more portable and configurable, # and exports the join operation as a reusable function. # Relies on sort order of file names: must be same length. # Could extend split/join to popup Tkinter file selectors. ########################################################## import os, sys readsize = 1024 def join(fromdir, tofile): output = open(tofile, 'wb') parts = os.listdir(fromdir) parts.sort( ) for filename in parts: filepath = os.path.join(fromdir, filename) fileobj = open(filepath, 'rb') while 1: filebytes = fileobj.read(readsize) if not filebytes: break output.write(filebytes) fileobj.close( ) output.close( ) if __name__ == '__main__': if len(sys.argv) == 2 and sys.argv[1] == '-help': print 'Use: join.py [from-dir-name to-file-name]' else: if len(sys.argv) != 3: interactive = 1 fromdir = raw_input('Directory containing part files? ') tofile = raw_input('Name of file to be recreated? ') else: interactive = 0 fromdir, tofile = sys.argv[1:] absfrom, absto = map(os.path.abspath, [fromdir, tofile]) print 'Joining', absfrom, 'to make', absto try: join(fromdir, tofile) except: print 'Error joining files:' print sys.exc_type, sys.exc_value else: print 'Join complete: see', absto if interactive: raw_input('Press Enter key') # pause if clicked
After running the join
script, they still may need
to run something like zip
,
gzip
, or tar
to unpack an
archive file, unless it’s shipped as an executable;[33] but at least they’re much closer to seeing the
Starship Enterprise spring into action. Here is a join in progress on
Windows, combining the split files we made a moment ago:
C:\temp>python %X%\System\Filetools\join.py -help
Use: join.py [from-dir-name to-file-name] C:\temp>python %X%\System\Filetools\join.py pysplit mypy152.exe
Joining C:\temp\pysplit to make C:\temp\mypy152.exe Join complete: see C:\temp\mypy152.exe C:\temp>ls -l mypy152.exe py152.exe
-rwxrwxrwa 1 0 0 5028339 Sep 12 06:05 mypy152.exe -rwxrwxrwa 1 0 0 5028339 Apr 16 1999 py152.exe C:\temp>fc /b mypy152.exe py152.exe
Comparing files mypy152.exe and py152.exe FC: no differences encountered
The join script simply uses os.listdir
to collect
all the part files in a directory created by split, and sorts the
filename list to put the parts back together in the correct order. We
get back an exact byte-for-byte copy of the original file (proved by
the DOS fc command above; use
cmp on Unix).
Some of this process is still manual, of course (I haven’t
quite figured out how to script the “walk the floppies
upstairs” bit yet), but the split
and
join
scripts make it both quick and simple to move
big files around. Because this script is also portable Python code,
it runs on any platform we care to move split files to. For instance,
it’s typical for my kids to download both Windows and Linux
games; since this script runs on either platform, they’re
covered.
Before we move on, there are a couple of details worth underscoring
in the join script’s code. First of all, notice that this
script deals with files in binary mode, but also reads each part file
in blocks of 1K bytes each. In fact, the readsize
setting here (the size of each block read from an input part file)
has no relation to chunksize
in
split.py
(the total size of each output part
file). As we learned in Chapter 2, this script
could instead read each part file all at once:
filebytes = open(filepath, 'rb').read( ) output.write(filebytes)
The downside to this scheme is that it really does load all of a file
into memory at once. For example, reading a 1.4M part file into
memory all at once with the file object read
method generates a 1.4M string in memory to hold the file’s
bytes. Since split
allows users to specify even
larger chunk sizes, the join
script plans for the
worst and reads in terms of limited-size blocks. To be completely
robust, the split
script could read its input data
in smaller chunks too, but this hasn’t become a concern in
practice.
If you study this script’s code closely, you may also notice
that the join scheme it uses relies completely on the sort order of
filenames in the parts directory. Because it simply calls the list
sort
method on the filenames list returned by
os.listdir
, it implicitly requires that filenames
have the same length and format when created by split. The splitter
uses zero-padding notation in a string formatting expression
('part%04d'
) to make sure that filenames all have
the same number of digits at the end (four), much like this list:
>>> list = ['xx008', 'xx010', 'xx006', 'xx009', 'xx011', 'xx111'] >>> list.sort( ) >>> list ['xx006', 'xx008', 'xx009', 'xx010', 'xx011', 'xx111']
When sorted, the leading zero characters in small numbers guarantee
that part files are ordered for joining correctly. Without the
leading zeroes, join
would fail whenever there
were more than nine part files, because the first digit would
dominate:
>>> list = ['xx8', 'xx10', 'xx6', 'xx9', 'xx11', 'xx111'] >>> list.sort( ) >>> list ['xx10', 'xx11', 'xx111', 'xx6', 'xx8', 'xx9']
Because the list sort
method accepts a comparison
function as an argument, we could in principle strip off digits in
filenames and sort numerically:
>>> list = ['xx8', 'xx10', 'xx6', 'xx9', 'xx11', 'xx111'] >>> list.sort(lambda x, y: cmp(int(x[2:]), int(y[2:]))) >>> list ['xx6', 'xx8', 'xx9', 'xx10', 'xx11', 'xx111']
But that still implies that filenames all must start with the same
length substring, so this doesn’t quite remove the file naming
dependency between the split
and
join
scripts. Because these scripts are designed
to be two steps of the same process, though, some dependencies
between them seem reasonable.
Let’s run a few more experiments with these Python system
utilities to demonstrate other usage modes. When run without full
command-line arguments, both split
and
join
are smart enough to input their parameters
interactively. Here they are chopping and gluing
the Python self-installer file on Windows again, with parameters
typed in the DOS console window:
C:\temp>python %X%\System\Filetools\split.py
File to be split?py152.exe
Directory to store part files?splitout
Splitting C:\temp\py152.exe to C:\temp\splitout by 1433600 Split finished: 4 parts are in C:\temp\splitout Press Enter key C:\temp>python %X%\System\Filetools\join.py
Directory containing part files?splitout
Name of file to be recreated?newpy152.exe
Joining C:\temp\splitout to make C:\temp\newpy152.exe Join complete: see C:\temp\newpy152.exe Press Enter key C:\temp>fc /B py152.exe newpy152.exe
Comparing files py152.exe and newpy152.exe FC: no differences encountered
When these program files are double-clicked in a file explorer GUI,
they work the same way (there usually are no command-line arguments
when launched this way). In this mode, absolute path displays help
clarify where files are really at. Remember, the current working
directory is the script’s home directory when clicked like
this, so the name tempsplit
actually maps to a
source code directory; type a full path to make the split files show
up somewhere else:
[in a popup DOS console box when split is clicked] File to be split?c:\temp\py152.exe
Directory to store part files?tempsplit
Splitting c:\temp\py152.exe to C:\PP2ndEd\examples\PP2E\System\Filetools\ tempsplit by 1433600 Split finished: 4 parts are in C:\PP2ndEd\examples\PP2E\System\Filetools\ tempsplit Press Enter key [in a popup DOS console box when join is clicked] Directory containing part files?tempsplit
Name of file to be recreated?c:\temp\morepy152.exe
Joining C:\PP2ndEd\examples\PP2E\System\Filetools\tempsplit to make c:\temp\morepy152.exe Join complete: see c:\temp\morepy152.exe Press Enter key
Because these scripts package their core logic up in functions, though, it’s just as easy to reuse their code by importing and calling from another Python component:
C:\temp>python
>>>from PP2E.System.Filetools.split import split
>>>from PP2E.System.Filetools.join import join
>>> >>>numparts = split('py152.exe', 'calldir')
>>>numparts
4 >>>join('calldir', 'callpy152.exe')
>>> >>>import os
>>>os.system(r'fc /B py152.exe callpy152.exe')
Comparing files py152.exe and callpy152.exe FC: no differences encountered 0
A word about performance: All the split
and
join
tests shown so far process a 5M file, but
take at most one second of real wall-clock time to finish on my Win-
dows 98 300 and 650 MHz laptop computers -- plenty fast for just
about any use I could imagine. (They run even faster after Windows
has cached information about the files involved.) Both scripts run
just as fast for other reasonable part file sizes too; here is the
splitter chopping up the file into 500,000- and 50,000-byte parts:
C:\temp>python %X%\System\Filetools\split.py py152.exe tempsplit 500000
Splitting C:\temp\py152.exe to C:\temp\tempsplit by 500000 Split finished: 11 parts are in C:\temp\tempsplit C:\temp>ls -l tempsplit
total 9826 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0001 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0002 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0003 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0004 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0005 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0006 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0007 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0008 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0009 -rwxrwxrwa 1 0 0 500000 Sep 12 06:29 part0010 -rwxrwxrwa 1 0 0 28339 Sep 12 06:29 part0011 C:\temp>python %X%\System\Filetools\split.py py152.exe tempsplit 50000
Splitting C:\temp\py152.exe to C:\temp\tempsplit by 50000 Split finished: 101 parts are in C:\temp\tempsplit C:\temp>ls tempsplit
part0001 part0014 part0027 part0040 part0053 part0066 part0079 part0092 part0002 part0015 part0028 part0041 part0054 part0067 part0080 part0093 part0003 part0016 part0029 part0042 part0055 part0068 part0081 part0094 part0004 part0017 part0030 part0043 part0056 part0069 part0082 part0095 part0005 part0018 part0031 part0044 part0057 part0070 part0083 part0096 part0006 part0019 part0032 part0045 part0058 part0071 part0084 part0097 part0007 part0020 part0033 part0046 part0059 part0072 part0085 part0098 part0008 part0021 part0034 part0047 part0060 part0073 part0086 part0099 part0009 part0022 part0035 part0048 part0061 part0074 part0087 part0100 part0010 part0023 part0036 part0049 part0062 part0075 part0088 part0101 part0011 part0024 part0037 part0050 part0063 part0076 part0089 part0012 part0025 part0038 part0051 part0064 part0077 part0090 part0013 part0026 part0039 part0052 part0065 part0078 part0091
Split can take longer to finish, but only if the part file’s size is set small enough to generate thousands of part files -- splitting into 1006 parts works, but runs slower (on my computer this split and join take about five and two seconds, respectively, depending on what other programs are open):
C:\temp>python %X%\System\Filetools\split.py py152.exe tempsplit 5000
Splitting C:\temp\py152.exe to C:\temp\tempsplit by 5000 Split finished: 1006 parts are in C:\temp\tempsplit C:\temp>python %X%\System\Filetools\join.py tempsplit mypy152.exe
Joining C:\temp\tempsplit to make C:\temp\py152.exe Join complete: see C:\temp\py152.exe C:\temp>fc /B py152.exe mypy152.exe
Comparing files py152.exe and mypy152.exe FC: no differences encountered C:\temp>ls -l tempsplit
...1000 lines deleted... -rwxrwxrwa 1 0 0 5000 Sep 12 06:30 part1001 -rwxrwxrwa 1 0 0 5000 Sep 12 06:30 part1002 -rwxrwxrwa 1 0 0 5000 Sep 12 06:30 part1003 -rwxrwxrwa 1 0 0 5000 Sep 12 06:30 part1004 -rwxrwxrwa 1 0 0 5000 Sep 12 06:30 part1005 -rwxrwxrwa 1 0 0 3339 Sep 12 06:30 part1006
Finally, the splitter is also smart enough to create the output directory if it doesn’t yet exist, or clear out any old files there if it does exist. Because the joiner combines whatever files exist in the output directory, this is a nice ergonomic touch -- if the output directory was not cleared before each split, it would be too easy to forget that a prior run’s files are still there. Given that my kids are running these scripts, they need to be as forgiving as possible; your user base may vary, but probably not by much.
C:\temp>python %X%\System\Filetools\split.py py152.exe tempsplit 700000
Splitting C:\temp\py152.exe to C:\temp\tempsplit by 700000
Split finished: 8 parts are in C:\temp\tempsplit
C:\temp>ls -l tempsplit
total 9827
-rwxrwxrwa 1 0 0 700000 Sep 12 06:32 part0001
-rwxrwxrwa 1 0 0 700000 Sep 12 06:32 part0002
-rwxrwxrwa 1 0 0 700000 Sep 12 06:32 part0003
...
...only new files here...
...
-rwxrwxrwa 1 0 0 700000 Sep 12 06:32 part0006
-rwxrwxrwa 1 0 0 700000 Sep 12 06:32 part0007
-rwxrwxrwa 1 0 0 128339 Sep 12 06:32 part0008
[32] I hope this doesn’t happen -- such a change would be a major break from backward compatibility, and could impact Python systems all over the world. On the other hand, it’s just a possibility for a future mutation of Python. I’m told that publishers of technical books love language changes, and this isn’t a text on politics.
[33] See also the built-in module gzip.py
in
the Python standard library; it provides tools for reading and
writing gzip
files, usually named with a
.gz
filename extension. It can be used to unpack
gzipped files, and serves as an all-Python equivalent of the standard
gzip
and gunzip
command-line
utility programs. This built-in module uses another called
zlib
that implements
gzip
-compatible data compressions. In Python 2.0,
see also the new zipfile
module for handling ZIP
format archives (different from gzip
).
Get Programming Python, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.