Credit: Alex Martelli, Magnus Lie Hetland
You need to read a file paragraph by paragraph, in which a paragraph is defined as a sequence of nonempty lines (in other words, paragraphs are separated by empty lines).
A wrapper class is, as usual, the right Pythonic architecture for this (in Python 2.1 and earlier):
class Paragraphs: def _ _init_ _(self, fileobj, separator='\n'): # Ensure that we get a line-reading sequence in the best way possible: import xreadlines try: # Check if the file-like object has an xreadlines method self.seq = fileobj.xreadlines( ) except AttributeError: # No, so fall back to the xreadlines module's implementation self.seq = xreadlines.xreadlines(fileobj) self.line_num = 0 # current index into self.seq (line number) self.para_num = 0 # current index into self (paragraph number) # Ensure that separator string includes a line-end character at the end if separator[-1:] != '\n': separator += '\n' self.separator = separator def _ _getitem_ _(self, index): if index != self.para_num: raise TypeError, "Only sequential access supported" self.para_num += 1 # Start where we left off and skip 0+ separator lines while 1: # Propagate IndexError, if any, since we're finished if it occurs line = self.seq[self.line_num] self.line_num += 1 if line != self.separator: break # Accumulate 1+ nonempty lines into result result = [line] while 1: # Intercept IndexError, since we have one last paragraph to return try: # Let's check if there's at least one more line in self.seq line = self.seq[self.line_num] except IndexError: # self.seq is finished, so we exit the loop break # Increment index into self.seq for next time self.line_num += 1 if line == self.separator: break result.append(line) return ''.join(result) # Here's an example function, showing how to use class Paragraphs: def show_paragraphs(filename, numpars=5): pp = Paragraphs(open(filename)) for p in pp: print "Par#%d, line# %d: %s" % ( pp.para_num, pp.line_num, repr(p)) if pp.para_num>numpars: break
Python doesn’t directly support paragraph-oriented
file reading, but, as usual, it’s not hard to add
such functionality. We define a paragraph as a string formed by joining
a nonempty sequence of nonseparator lines, separated from any
adjoining paragraphs by nonempty sequences of separator lines. By
default, a separator line is one that equals '\n'
(empty line), although this concept is easy to generalize. We let the
client code determine what a separator is when instantiating this
class. Any string is acceptable, but we append a
'\n'
to it, if it doesn’t already
end with '\n'
(since we read the underlying file
line by line, a separator not ending with '\n'
would never match).
We can get even more generality by having the client code pass us a
callable that looks at any line and tells us whether that line is a
separator or not. In fact, this is how I originally architected this
recipe, but then I decided that such an architecture represented a
typical, avoidable case of overgeneralization (also known as
overengineering and “Big Design Up
Front”; see http://xp.c2.com/BigDesignUpFront.html), so I
backtracked to the current, more reasonable amount of generality.
Indeed, another reasonable design choice for this
recipe’s class would be to completely forego the
customizability of what lines are to be considered separators and
just test for separator lines with line.isspace( )
, so that stray blanks on an empty-looking line
wouldn’t misleadingly transform it into a
nonseparator line.
This recipe’s adapter class is a special case of
sequence adaptation by bunching. An underlying sequence (here, a
sequence of lines, provided by xreadlines
on a
file or file-like object) is bunched up into another sequence of
larger units (here, a sequence of paragraph strings). The pattern is
easy to generalize to other sequence-bunching needs. Of course,
it’s even easier with iterators and generators in
Python 2.2, but even Python 2.1 is pretty good at this already.
Sequence adaptation is an important general issue that arises
particularly often when you are sequentially reading and/or writing
files; see Recipe 4.10 for another example.
For Python 2.1, we need an index of the underlying sequence of lines
and a way to check that our _ _getitem_ _
method
is being called with properly sequential indexes (as the
for
statement does), so we expose the
line_num
and para_num
indexes
as useful attributes of our object. Thus, client code can determine
our position during a sequential scan, in regard to the indexing on
the underlying line sequence, the paragraph sequence, or both,
without needing to track it itself.
The code uses two separate loops, each in a typical pattern:
while 1: ... if xxx: break
The first loop skips over zero or more separators that may occur between arbitrary paragraphs. Then, a separate loop accumulates nonseparators into a result list, until the underlying file finishes or a separator is encountered.
It’s an elementary issue, but quite important to
performance, to build up the result as a list of strings and combine
them with ''.join
at the end. Building up a large
string as a string, by repeated application of +=
in a loop, is never the right approach—it’s
slow and clumsy. Good Pythonic style demands using a list as the
intermediate accumulator when building up a string.
The
show_paragraphs
function demonstrates all the simple features of the
Paragraphs
class and can be used to unit-test the
latter by feeding it a known text file.
Python 2.2 makes it very easy to build iterators and generators. This, in turn, makes it very tempting to build a more lightweight version of the by-paragraph buncher as a generator function, with no classes involved:
from _ _future_ _ import generators
def paragraphs(fileobj, separator='\n'):
if separator[-1:] != '\n': separator += '\n'
paragraph = []
for line in fileobj:
if line == separator:
if paragraph:
yield ''.join(paragraph)
paragraph = []
else:
paragraph.append(line)
if paragraph: yield ''.join(paragraph)
We don’t get the line and paragraph numbers, but the
approach is much more lightweight, and it works polymorphically on
any fileobj
that can be iterated on to yield a
sequence of lines, not just a file or file-like object. Such useful
polymorphism is always a nice plus, particularly considering that
it’s basically free. Here, we have merged the loops
into one, and we use the intermediate list
paragraph
itself as the state indicator. If the
list is empty, we’re skipping separators; otherwise,
we’re accumulating nonseparators.
Recipe 4.10; documentation on the
xreadlines
module in the Library Reference; the Big Design Up Front Wiki page (http://xp.c2.com/BigDesignUpFront.html).
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.