Chapter 4. Iterators and Generators
Iteration is one of Pythonâs strongest
features. At a high level, you might simply view iteration as a way to
process items in a sequence. However, there is so much more that is
possible, such as creating your own iterator objects, applying useful iteration
patterns in the itertools
module, making generator functions, and so forth.
This chapter aims to address common problems involving iteration.
4.1. Manually Consuming an Iterator
Problem
You need to process items in an iterable, but for whatever reason,
you canât or donât want to use a for
loop.
Solution
To manually consume an iterable, use the next()
function and
write your code to catch the StopIteration
exception. For example,
this example manually reads lines from a file:
with
open
(
'/etc/passwd'
)
as
f
:
try
:
while
True
:
line
=
next
(
f
)
(
line
,
end
=
''
)
except
StopIteration
:
pass
Normally, StopIteration
is used to signal the end of iteration. However,
if youâre using next()
manually (as shown), you can also instruct it to
return a terminating value, such as None
, instead. For example:
with
open
(
'/etc/passwd'
)
as
f
:
while
True
:
line
=
next
(
f
,
None
)
if
line
is
None
:
break
(
line
,
end
=
''
)
Discussion
In most cases, the for
statement is used to consume an iterable.
However, every now and then, a problem calls for more precise control
over the underlying iteration mechanism. Thus, it is useful to know
what actually happens.
The following interactive example illustrates the basic mechanics of what happens during iteration:
>>>
items
=
[
1
,
2
,
3
]
>>>
# Get the iterator
>>>
it
=
iter
(
items
)
# Invokes items.__iter__()
>>>
# Run the iterator
>>>
next
(
it
)
# Invokes it.__next__()
1
>>>
next
(
it
)
2
>>>
next
(
it
)
3
>>>
next
(
it
)
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
StopIteration
>>>
Subsequent recipes in this chapter expand on iteration techniques, and knowledge of the basic iterator protocol is assumed. Be sure to tuck this first recipe away in your memory.
4.2. Delegating Iteration
Problem
You have built a custom container object that internally holds a list, tuple, or some other iterable. You would like to make iteration work with your new container.
Solution
Typically, all you need to do is define an __iter__()
method that
delegates iteration to the internally held container. For example:
class
Node
:
def
__init__
(
self
,
value
):
self
.
_value
=
value
self
.
_children
=
[]
def
__repr__
(
self
):
return
'Node({!r})'
.
format
(
self
.
_value
)
def
add_child
(
self
,
node
):
self
.
_children
.
append
(
node
)
def
__iter__
(
self
):
return
iter
(
self
.
_children
)
# Example
if
__name__
==
'__main__'
:
root
=
Node
(
0
)
child1
=
Node
(
1
)
child2
=
Node
(
2
)
root
.
add_child
(
child1
)
root
.
add_child
(
child2
)
for
ch
in
root
:
(
ch
)
# Outputs Node(1), Node(2)
In this code, the __iter__()
method simply forwards the iteration
request to the internally held _children
attribute.
Discussion
Pythonâs iterator protocol requires __iter__()
to return a
special iterator object that implements a __next__()
method to
carry out the actual iteration. If all you are doing is iterating
over the contents of another container, you donât really
need to worry about the underlying details of how it works.
All you need to do is to forward the iteration request along.
The use of the iter()
function here is a bit of
a shortcut that cleans up the code. iter(s)
simply returns
the underlying iterator by calling s.__iter__()
, much in the
same way that len(s)
invokes s.__len__()
.
4.3. Creating New Iteration Patterns with Generators
Problem
You want to implement a custom iteration pattern thatâs
different than the usual built-in functions (e.g., range()
,
reversed()
, etc.).
Solution
If you want to implement a new kind of iteration pattern, define it using a generator function. Hereâs a generator that produces a range of floating-point numbers:
def
frange
(
start
,
stop
,
increment
):
x
=
start
while
x
<
stop
:
yield
x
x
+=
increment
To use such a function, you iterate over it using a for
loop or
use it with some other function that consumes an iterable (e.g., sum()
,
list()
, etc.). For example:
>>>
for
n
in
frange
(
0
,
4
,
0.5
):
...
(
n
)
...
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
>>>
list
(
frange
(
0
,
1
,
0.125
))
[0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875]
>>>
Discussion
The mere presence of the yield
statement in a function turns
it into a generator. Unlike a normal function, a generator
only runs in response to iteration. Hereâs an experiment you
can try to see the underlying mechanics of how such a function works:
>>>
def
countdown
(
n
):
...
(
'Starting to count from'
,
n
)
...
while
n
>
0
:
...
yield
n
...
n
-=
1
...
(
'Done!'
)
...
>>>
# Create the generator, notice no output appears
>>>
c
=
countdown
(
3
)
>>>
c
<generator object countdown at 0x1006a0af0>
>>>
# Run to first yield and emit a value
>>>
next
(
c
)
Starting to count from 3
3
>>>
# Run to the next yield
>>>
next
(
c
)
2
>>>
# Run to next yield
>>>
next
(
c
)
1
>>>
# Run to next yield (iteration stops)
>>>
next
(
c
)
Done!
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
StopIteration
>>>
The key feature is that a generator function only runs in response to
ânextâ operations carried out in iteration. Once a generator function
returns, iteration stops. However, the for
statement thatâs usually used to
iterate takes care of these details, so you donât normally need to worry about them.
4.4. Implementing the Iterator Protocol
Problem
You are building custom objects on which you would like to support iteration, but would like an easy way to implement the iterator protocol.
Solution
By far, the easiest way to implement iteration on an object is to
use a generator function. In Recipe 4.2, a Node
class
was presented for representing tree structures. Perhaps you want to implement an iterator that traverses nodes in a depth-first pattern.
Here is how you could do it:
class
Node
:
def
__init__
(
self
,
value
):
self
.
_value
=
value
self
.
_children
=
[]
def
__repr__
(
self
):
return
'Node({!r})'
.
format
(
self
.
_value
)
def
add_child
(
self
,
node
):
self
.
_children
.
append
(
node
)
def
__iter__
(
self
):
return
iter
(
self
.
_children
)
def
depth_first
(
self
):
yield
self
for
c
in
self
:
yield from
c
.
depth_first
()
# Example
if
__name__
==
'__main__'
:
root
=
Node
(
0
)
child1
=
Node
(
1
)
child2
=
Node
(
2
)
root
.
add_child
(
child1
)
root
.
add_child
(
child2
)
child1
.
add_child
(
Node
(
3
))
child1
.
add_child
(
Node
(
4
))
child2
.
add_child
(
Node
(
5
))
for
ch
in
root
.
depth_first
():
(
ch
)
# Outputs Node(0), Node(1), Node(3), Node(4), Node(2), Node(5)
In this code, the depth_first()
method is simple to read and
describe. It first yields itself and then iterates over each child
yielding the items produced by the childâs depth_first()
method
(using yield from
).
Discussion
Pythonâs iterator protocol requires __iter__()
to return a
special iterator object that implements a __next__()
operation and uses a
StopIteration
exception to signal completion. However, implementing
such objects can often be a messy affair. For example, the following
code shows an alternative implementation of the depth_first()
method
using an associated iterator class:
class
Node
:
def
__init__
(
self
,
value
):
self
.
_value
=
value
self
.
_children
=
[]
def
__repr__
(
self
):
return
'Node({!r})'
.
format
(
self
.
_value
)
def
add_child
(
self
,
other_node
):
self
.
_children
.
append
(
other_node
)
def
__iter__
(
self
):
return
iter
(
self
.
_children
)
def
depth_first
(
self
):
return
DepthFirstIterator
(
self
)
class
DepthFirstIterator
(
object
):
'''
Depth-first traversal
'''
def
__init__
(
self
,
start_node
):
self
.
_node
=
start_node
self
.
_children_iter
=
None
self
.
_child_iter
=
None
def
__iter__
(
self
):
return
self
def
__next__
(
self
):
# Return myself if just started; create an iterator for children
if
self
.
_children_iter
is
None
:
self
.
_children_iter
=
iter
(
self
.
_node
)
return
self
.
_node
# If processing a child, return its next item
elif
self
.
_child_iter
:
try
:
nextchild
=
next
(
self
.
_child_iter
)
return
nextchild
except
StopIteration
:
self
.
_child_iter
=
None
return
next
(
self
)
# Advance to the next child and start its iteration
else
:
self
.
_child_iter
=
next
(
self
.
_children_iter
)
.
depth_first
()
return
next
(
self
)
The DepthFirstIterator
class works in the same way as the generator
version, but itâs a mess because the iterator has to maintain
a lot of complex state about where it is in the iteration process.
Frankly, nobody likes to write mind-bending code like that.
Define your iterator as a generator and be done with it.
4.5. Iterating in Reverse
Solution
Use the built-in reversed()
function. For example:
>>>
a
=
[
1
,
2
,
3
,
4
]
>>>
for
x
in
reversed
(
a
):
...
(
x
)
...
4
3
2
1
Reversed iteration only works if the object in question has a size that can
be determined or if the object implements a __reversed__()
special method.
If neither of these can be satisfied, youâll have to convert the object
into a list first. For example:
# Print a file backwards
f
=
open
(
'somefile'
)
for
line
in
reversed
(
list
(
f
)):
(
line
,
end
=
''
)
Be aware that turning an iterable into a list as shown could consume a lot of memory if itâs large.
Discussion
Many programmers donât realize that reversed iteration can be customized
on user-defined classes if they implement the __reversed__()
method.
For example:
class
Countdown
:
def
__init__
(
self
,
start
):
self
.
start
=
start
# Forward iterator
def
__iter__
(
self
):
n
=
self
.
start
while
n
>
0
:
yield
n
n
-=
1
# Reverse iterator
def
__reversed__
(
self
):
n
=
1
while
n
<=
self
.
start
:
yield
n
n
+=
1
Defining a reversed iterator makes the code much more efficient, as itâs no longer necessary to pull the data into a list and iterate in reverse on the list.
4.6. Defining Generator Functions with Extra State
Problem
You would like to define a generator function, but it involves extra state that you would like to expose to the user somehow.
Solution
If you want a generator to expose extra state to the user,
donât forget that you can easily implement it as a class, putting
the generator function code in the __iter__()
method. For
example:
from
collections
import
deque
class
linehistory
:
def
__init__
(
self
,
lines
,
histlen
=
3
):
self
.
lines
=
lines
self
.
history
=
deque
(
maxlen
=
histlen
)
def
__iter__
(
self
):
for
lineno
,
line
in
enumerate
(
self
.
lines
,
1
):
self
.
history
.
append
((
lineno
,
line
))
yield
line
def
clear
(
self
):
self
.
history
.
clear
()
To use this class, you would treat it like a normal generator
function. However, since it creates an instance, you can
access internal attributes, such as the history
attribute or
the clear()
method. For example:
with
open
(
'somefile.txt'
)
as
f
:
lines
=
linehistory
(
f
)
for
line
in
lines
:
if
'python'
in
line
:
for
lineno
,
hline
in
lines
.
history
:
(
'{}:{}'
.
format
(
lineno
,
hline
),
end
=
''
)
Discussion
With generators, it is easy to fall into a trap of trying to do
everything with functions alone. This can lead to rather complicated
code if the generator function needs to interact with other parts of
your program in unusual ways (exposing attributes, allowing control
via method calls, etc.). If this is the case, just use a class
definition, as shown. Defining your generator in the __iter__()
method doesnât change anything about how you write your algorithm.
The fact that itâs part of a class makes it easy for you to provide
attributes and methods for users to interact with.
One potential subtlety with the method shown is that it might require
an extra step of calling iter()
if you are going to drive iteration
using a technique other than a for
loop. For example:
>>>
f
=
open
(
'somefile.txt'
)
>>>
lines
=
linehistory
(
f
)
>>>
next
(
lines
)
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
TypeError
:'linehistory' object is not an iterator
>>>
# Call iter() first, then start iterating
>>>
it
=
iter
(
lines
)
>>>
next
(
it
)
'hello world\n'
>>>
next
(
it
)
'this is a test\n'
>>>
4.7. Taking a Slice of an Iterator
Problem
You want to take a slice of data produced by an iterator, but the normal slicing operator doesnât work.
Solution
The itertools.islice()
function is perfectly suited
for taking slices of iterators and generators. For example:
>>>
def
count
(
n
):
...
while
True
:
...
yield
n
...
n
+=
1
...
>>>
c
=
count
(
0
)
>>>
c
[
10
:
20
]
Traceback
(
most
recent
call
last
):
File
"<stdin>"
,
line
1
,
in
<
module
>
TypeError
:
'generator'
object
is
not
subscriptable
>>>
# Now using islice()
>>>
import
itertools
>>>
for
x
in
itertools
.
islice
(
c
,
10
,
20
):
...
(
x
)
...
10
11
12
13
14
15
16
17
18
19
>>>
Discussion
Iterators and generators canât normally be sliced, because no
information is known about their length (and they donât implement
indexing). The result of islice()
is an iterator that produces the
desired slice items, but it does this by consuming and discarding all
of the items up to the starting slice index. Further items are then
produced by the islice
object until the ending index has been
reached.
Itâs important to emphasize that islice()
will consume data
on the supplied iterator. Since iterators canât be rewound,
that is something to consider. If itâs important to go back, you
should probably just turn the data into a list first.
4.8. Skipping the First Part of an Iterable
Problem
You want to iterate over items in an iterable, but the first few items arenât of interest and you just want to discard them.
Solution
The itertools
module has a few functions that can be used to
address this task. The first is the itertools.dropwhile()
function.
To use it, you supply a function and an iterable. The returned
iterator discards the first items in the sequence as long as the
supplied function returns True
. Afterward, the entirety of the
sequence is produced.
To illustrate, suppose you are reading a file that starts with a series of comment lines. For example:
>>>
with
open
(
'/etc/passwd'
)
as
f
:
...
for
line
in
f
:
...
(
line
,
end
=
''
)
...
##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode. At other times, this information is provided by
# Open Directory.
...
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
...
>>>
If you want to skip all of the initial comment lines, hereâs one way to do it:
>>>
from
itertools
import
dropwhile
>>>
with
open
(
'/etc/passwd'
)
as
f
:
...
for
line
in
dropwhile
(
lambda
line
:
line
.
startswith
(
'#'
),
f
):
...
(
line
,
end
=
''
)
...
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
...
>>>
This example is based on skipping the first items according to a test
function. If you happen to know the exact number of items you want to
skip, then you can use itertools.islice()
instead. For example:
>>>
from
itertools
import
islice
>>>
items
=
[
'a'
,
'b'
,
'c'
,
1
,
4
,
10
,
15
]
>>>
for
x
in
islice
(
items
,
3
,
None
):
...
(
x
)
...
1
4
10
15
>>>
In this example, the last None
argument to islice()
is required
to indicate that you want everything beyond the first three items
as opposed to only the first three items (e.g., a slice of [3:]
as
opposed to a slice of [:3]
).
Discussion
The dropwhile()
and islice()
functions are mainly convenience functions
that you can use to avoid writing rather messy code such as this:
with
open
(
'/etc/passwd'
)
as
f
:
# Skip over initial comments
while
True
:
line
=
next
(
f
,
''
)
if
not
line
.
startswith
(
'#'
):
break
# Process remaining lines
while
line
:
# Replace with useful processing
(
line
,
end
=
''
)
line
=
next
(
f
,
None
)
Discarding the first part of an iterable is also slightly different than simply filtering all of it. For example, the first part of this recipe might be rewritten as follows:
with
open
(
'/etc/passwd'
)
as
f
:
lines
=
(
line
for
line
in
f
if
not
line
.
startswith
(
'#'
))
for
line
in
lines
:
(
line
,
end
=
''
)
This will obviously discard the comment lines at the start, but will also discard all such lines throughout the entire file. On the other hand, the solution only discards items until an item no longer satisfies the supplied test. After that, all subsequent items are returned with no filtering.
Last, but not least, it should be emphasized that this recipe works with all iterables, including those whose size canât be determined in advance. This includes generators, files, and similar kinds of objects.
4.9. Iterating Over All Possible Combinations or Permutations
Problem
You want to iterate over all of the possible combinations or permutations of a collection of items.
Solution
The itertools
module provides three functions for this task.
The first of theseâitertools.permutations()
âtakes a collection of items and produces
a sequence of tuples that rearranges all of the items into all
possible permutations (i.e., it shuffles them into all possible
configurations). For example:
>>>
items
=
[
'a'
,
'b'
,
'c'
]
>>>
from
itertools
import
permutations
>>>
for
p
in
permutations
(
items
):
...
(
p
)
...
('a', 'b', 'c')
('a', 'c', 'b')
('b', 'a', 'c')
('b', 'c', 'a')
('c', 'a', 'b')
('c', 'b', 'a')
>>>
If you want all permutations of a smaller length, you can give an optional length argument. For example:
>>>
for
p
in
permutations
(
items
,
2
):
...
(
p
)
...
('a', 'b')
('a', 'c')
('b', 'a')
('b', 'c')
('c', 'a')
('c', 'b')
>>>
Use itertools.combinations()
to produce a sequence of combinations
of items taken from the input. For example:
>>>
from
itertools
import
combinations
>>>
for
c
in
combinations
(
items
,
3
):
...
(
c
)
...
('a', 'b', 'c')
>>>
for
c
in
combinations
(
items
,
2
):
...
(
c
)
...
('a', 'b')
('a', 'c')
('b', 'c')
>>>
for
c
in
combinations
(
items
,
1
):
...
(
c
)
...
('a',)
('b',)
('c',)
>>>
For combinations()
, the actual order of the elements is not considered.
That is, the combination ('a', 'b')
is considered to be the same as ('b',
'a')
(which is not produced).
When producing combinations, chosen items are removed from the
collection of possible candidates (i.e., if 'a'
has already been
chosen, then it is removed from consideration). The
itertools.combinations_with_replacement()
function relaxes this, and
allows the same item to be chosen more than once. For example:
>>>
for
c
in
combinations_with_replacement
(
items
,
3
):
...
(
c
)
...
('a', 'a', 'a')
('a', 'a', 'b')
('a', 'a', 'c')
('a', 'b', 'b')
('a', 'b', 'c')
('a', 'c', 'c')
('b', 'b', 'b')
('b', 'b', 'c')
('b', 'c', 'c')
('c', 'c', 'c')
>>>
Discussion
This recipe demonstrates only some of the power found in the
itertools
module. Although you could certainly write code to
produce permutations and combinations yourself, doing so would
probably require more than a fair bit of thought. When faced with
seemingly complicated iteration problems, it always pays to look
at itertools
first. If the problem is common, chances are a solution
is already available.
4.10. Iterating Over the Index-Value Pairs of a Sequence
Problem
You want to iterate over a sequence, but would like to keep track of which element of the sequence is currently being processed.
Solution
The built-in enumerate()
function handles this quite nicely:
>>>
my_list
=
[
'a'
,
'b'
,
'c'
]
>>>
for
idx
,
val
in
enumerate
(
my_list
):
...
(
idx
,
val
)
...
0 a
1 b
2 c
For printing output with canonical line numbers (where you typically start
the numbering at 1 instead of 0), you can pass in a start
argument:
>>>
my_list
=
[
'a'
,
'b'
,
'c'
]
>>>
for
idx
,
val
in
enumerate
(
my_list
,
1
):
...
(
idx
,
val
)
...
1 a
2 b
3 c
This case is especially useful for tracking line numbers in files should you want to use a line number in an error message:
def
parse_data
(
filename
):
with
open
(
filename
,
'rt'
)
as
f
:
for
lineno
,
line
in
enumerate
(
f
,
1
):
fields
=
line
.
split
()
try
:
count
=
int
(
fields
[
1
])
...
except
ValueError
as
e
:
(
'Line {}: Parse error: {}'
.
format
(
lineno
,
e
))
enumerate()
can be handy for keeping track of the offset into a list for
occurrences of certain values, for example. So, if you want to map words in a
file to the lines in which they occur, it can easily be accomplished
using enumerate()
to map each word to the line offset in the file where it
was found:
word_summary
=
defaultdict
(
list
)
with
open
(
'myfile.txt'
,
'r'
)
as
f
:
lines
=
f
.
readlines
()
for
idx
,
line
in
enumerate
(
lines
):
# Create a list of words in current line
words
=
[
w
.
strip
()
.
lower
()
for
w
in
line
.
split
()]
for
word
in
words
:
word_summary
[
word
]
.
append
(
idx
)
If you print word_summary
after processing the file, itâll be a dictionary
(a defaultdict
to be precise), and itâll have a key for each word. The value
for each word-key will be a list of line numbers that word occurred on. If
the word occurred twice on a single line, that line number will be listed
twice, making it possible to identify various simple metrics about the text.
Discussion
enumerate()
is a nice shortcut for situations where you might be
inclined to keep your own counter variable. You could write code like this:
lineno
=
1
for
line
in
f
:
# Process line
...
lineno
+=
1
But itâs usually much more elegant (and less error prone) to use enumerate()
instead:
for
lineno
,
line
in
enumerate
(
f
):
# Process line
...
The value returned by enumerate()
is an instance of an enumerate
object,
which is an iterator that returns successive tuples consisting of a
counter and the value returned by calling next()
on the sequence
youâve passed in.
Although a minor point, itâs worth mentioning that sometimes it is
easy to get tripped up when applying enumerate()
to a sequence
of tuples that are also being unpacked. To do it, you have to
write code like this:
data
=
[
(
1
,
2
),
(
3
,
4
),
(
5
,
6
),
(
7
,
8
)
]
# Correct!
for
n
,
(
x
,
y
)
in
enumerate
(
data
):
...
# Error!
for
n
,
x
,
y
in
enumerate
(
data
):
...
4.11. Iterating Over Multiple Sequences Simultaneously
Solution
To iterate over more than one sequence simultaneously, use the zip()
function. For example:
>>>
xpts
=
[
1
,
5
,
4
,
2
,
10
,
7
]
>>>
ypts
=
[
101
,
78
,
37
,
15
,
62
,
99
]
>>>
for
x
,
y
in
zip
(
xpts
,
ypts
):
...
(
x
,
y
)
...
1 101
5 78
4 37
2 15
10 62
7 99
>>>
zip(a, b)
works by creating an iterator that produces tuples (x,
y)
where x
is taken from a
and y
is taken from b
. Iteration
stops whenever one of the input sequences is exhausted. Thus, the
length of the iteration is the same as the length of the shortest input.
For example:
>>>
a
=
[
1
,
2
,
3
]
>>>
b
=
[
'w'
,
'x'
,
'y'
,
'z'
]
>>>
for
i
in
zip
(
a
,
b
):
...
(
i
)
...
(1, 'w')
(2, 'x')
(3, 'y')
>>>
If this behavior is not desired, use itertools.zip_longest()
instead.
For example:
>>>
from
itertools
import
zip_longest
>>>
for
i
in
zip_longest
(
a
,
b
):
...
(
i
)
...
(1, 'w')
(2, 'x')
(3, 'y')
(None, 'z')
>>>
for
i
in
zip_longest
(
a
,
b
,
fillvalue
=
0
):
...
(
i
)
...
(1, 'w')
(2, 'x')
(3, 'y')
(0, 'z')
>>>
Discussion
zip()
is commonly used whenever you need to pair data together. For
example, suppose you have a list of column headers and column values
like this:
headers
=
[
'name'
,
'shares'
,
'price'
]
values
=
[
'ACME'
,
100
,
490.1
]
Using zip()
, you can pair the values together to make a dictionary like
this:
s
=
dict
(
zip
(
headers
,
values
))
Alternatively, if you are trying to produce output, you can write code like this:
for
name
,
val
in
zip
(
headers
,
values
):
(
name
,
'='
,
val
)
Itâs less common, but zip()
can be passed more than two sequences as input.
For this case, the resulting tuples have the same number of items in them
as the number of input sequences. For example:
>>>
a
=
[
1
,
2
,
3
]
>>>
b
=
[
10
,
11
,
12
]
>>>
c
=
[
'x'
,
'y'
,
'z'
]
>>>
for
i
in
zip
(
a
,
b
,
c
):
...
(
i
)
...
(1, 10, 'x')
(2, 11, 'y')
(3, 12, 'z')
>>>
Last, but not least, itâs important to emphasize that zip()
creates
an iterator as a result. If you need the paired values stored in a
list, use the list()
function. For example:
>>>
zip
(
a
,
b
)
<zip object at 0x1007001b8>
>>>
list
(
zip
(
a
,
b
))
[(1, 10), (2, 11), (3, 12)]
>>>
4.12. Iterating on Items in Separate Containers
Problem
You need to perform the same operation on many objects, but the objects are contained in different containers, and youâd like to avoid nested loops without losing the readability of your code.
Solution
The itertools.chain()
method can be used to simplify this task. It
takes a list of iterables as input, and returns an iterator that
effectively masks the fact that youâre really acting on multiple
containers. To illustrate, consider this example:
>>>
from
itertools
import
chain
>>>
a
=
[
1
,
2
,
3
,
4
]
>>>
b
=
[
'x'
,
'y'
,
'z'
]
>>>
for
x
in
chain
(
a
,
b
):
...
(
x
)
...
1
2
3
4
x
y
z
>>>
A common use of chain()
is in programs where you would like to perform
certain operations on all of the items at once but the items are pooled into different working sets. For example:
# Various working sets of items
active_items
=
set
()
inactive_items
=
set
()
# Iterate over all items
for
item
in
chain
(
active_items
,
inactive_items
):
# Process item
...
This solution is much more elegant than using two separate loops, as in the following:
for
item
in
active_items
:
# Process item
...
for
item
in
inactive_items
:
# Process item
...
Discussion
itertools.chain()
accepts one or more iterables as arguments. It
then works by creating an iterator that successively consumes and
returns the items produced by each of the supplied iterables you
provided. Itâs a subtle distinction, but chain()
is more
efficient than first combining the sequences and iterating. For example:
# Inefficent
for
x
in
a
+
b
:
...
# Better
for
x
in
chain
(
a
,
b
):
...
In the first case, the operation a + b
creates an entirely new sequence
and additionally requires a
and b
to be of the same type.
chain()
performs no such operation, so itâs far more efficient with
memory if the input sequences are large and it can be easily applied when
the iterables in question are of different types.
4.13. Creating Data Processing Pipelines
Problem
You want to process data iteratively in the style of a data processing pipeline (similar to Unix pipes). For instance, you have a huge amount of data that needs to be processed, but it canât fit entirely into memory.
Solution
Generator functions are a good way to implement processing pipelines. To illustrate, suppose you have a huge directory of log files that you want to process:
foo/ access-log-012007.gz access-log-022007.gz access-log-032007.gz ... access-log-012008 bar/ access-log-092007.bz2 ... access-log-022008
Suppose each file contains lines of data like this:
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71 210.212.209.67 - - [10/Jul/2012:00:18:51 -0500] "GET /ply/ ..." 200 11875 210.212.209.67 - - [10/Jul/2012:00:18:51 -0500] "GET /favicon.ico ..." 404 369 61.135.216.105 - - [10/Jul/2012:00:20:04 -0500] "GET /blog/atom.xml ..." 304 - ...
To process these files, you could define a collection of small generator functions that perform specific self-contained tasks. For example:
import
os
import
fnmatch
import
gzip
import
bz2
import
re
def
gen_find
(
filepat
,
top
):
'''
Find all filenames in a directory tree that match a shell wildcard pattern
'''
for
path
,
dirlist
,
filelist
in
os
.
walk
(
top
):
for
name
in
fnmatch
.
filter
(
filelist
,
filepat
):
yield
os
.
path
.
join
(
path
,
name
)
def
gen_opener
(
filenames
):
'''
Open a sequence of filenames one at a time producing a file object.
The file is closed immediately when proceeding to the next iteration.
'''
for
filename
in
filenames
:
if
filename
.
endswith
(
'.gz'
):
f
=
gzip
.
open
(
filename
,
'rt'
)
elif
filename
.
endswith
(
'.bz2'
):
f
=
bz2
.
open
(
filename
,
'rt'
)
else
:
f
=
open
(
filename
,
'rt'
)
yield
f
f
.
close
()
def
gen_concatenate
(
iterators
):
'''
Chain a sequence of iterators together into a single sequence.
'''
for
it
in
iterators
:
yield from
it
def
gen_grep
(
pattern
,
lines
):
'''
Look for a regex pattern in a sequence of lines
'''
pat
=
re
.
compile
(
pattern
)
for
line
in
lines
:
if
pat
.
search
(
line
):
yield
line
You can now easily stack these functions together to make a processing pipeline. For example, to find all log lines that contain the word python, you would just do this:
lognames
=
gen_find
(
'access-log*'
,
'www'
)
files
=
gen_opener
(
lognames
)
lines
=
gen_concatenate
(
files
)
pylines
=
gen_grep
(
'(?i)python'
,
lines
)
for
line
in
pylines
:
(
line
)
If you want to extend the pipeline further, you can even feed the data in generator expressions. For example, this version finds the number of bytes transferred and sums the total:
lognames
=
gen_find
(
'access-log*'
,
'www'
)
files
=
gen_opener
(
lognames
)
lines
=
gen_concatenate
(
files
)
pylines
=
gen_grep
(
'(?i)python'
,
lines
)
bytecolumn
=
(
line
.
rsplit
(
None
,
1
)[
1
]
for
line
in
pylines
)
bytes
=
(
int
(
x
)
for
x
in
bytecolumn
if
x
!=
'-'
)
(
'Total'
,
sum
(
bytes
))
Discussion
Processing data in a pipelined manner works well for a wide variety of other problems, including parsing, reading from real-time data sources, periodic polling, and so on.
In understanding the code, it is important to grasp that
the yield
statement acts as a kind of data producer whereas a
for
loop acts as a data consumer. When the generators are
stacked together, each yield
feeds a single item of data
to the next stage of the pipeline that is consuming it with iteration.
In the last example, the sum()
function is actually driving the
entire program, pulling one item at a time out of the pipeline
of generators.
One nice feature of this approach is that each generator function tends to be small and self-contained. As such, they are easy to write and maintain. In many cases, they are so general purpose that they can be reused in other contexts. The resulting code that glues the components together also tends to read like a simple recipe that is easily understood.
The memory efficiency of this approach can also not be overstated. The code shown would still work even if used on a massive directory of files. In fact, due to the iterative nature of the processing, very little memory would be used at all.
There is a bit of extreme subtlety involving the gen_concatenate()
function. The purpose of this function is to concatenate input sequences
together into one long sequence of lines. The itertools.chain()
function performs a similar function, but requires that all of the
chained iterables be specified as arguments. In the case of this
particular recipe, doing that would involve a statement such as lines
= itertools.chain(*files)
, which would cause the gen_opener()
generator to be fully consumed. Since that generator is producing a
sequence of open files that are immediately closed in the next
iteration step, chain()
canât be used. The solution shown avoids
this issue.
Also appearing in the gen_concatenate()
function is the use of yield from
to delegate to a subgenerator. The statement yield from it
simply makes
gen_concatenate()
emit all of the values produced by the generator it
.
This is described further in Recipe 4.14.
Last, but not least, it should be noted that a pipelined approach doesnât always work for every data handling problem. Sometimes you just need to work with all of the data at once. However, even in that case, using generator pipelines can be a way to logically break a problem down into a kind of workflow.
David Beazley has written extensively about these techniques in his âGenerator Tricks for Systems Programmersâ tutorial presentation. Consult that for even more examples.
4.14. Flattening a Nested Sequence
Solution
This is easily solved by writing a recursive generator function
involving a yield from
statement. For example:
from
collections
import
Iterable
def
flatten
(
items
,
ignore_types
=
(
str
,
bytes
)):
for
x
in
items
:
if
isinstance
(
x
,
Iterable
)
and
not
isinstance
(
x
,
ignore_types
):
yield from
flatten
(
x
,
ignore_types
)
else
:
yield
x
items
=
[
1
,
2
,
[
3
,
4
,
[
5
,
6
],
7
],
8
]
# Produces 1 2 3 4 5 6 7 8
for
x
in
flatten
(
items
):
(
x
)
In the code, the isinstance(x, Iterable)
simply checks to see
if an item is iterable. If so, yield from
is used to emit
all of its values as a kind of subroutine. The end result is a
single sequence of output with no nesting.
The extra argument ignore_types
and the check for not isinstance(x,
ignore_types)
is there to prevent strings and bytes from being
interpreted as iterables and expanded as individual characters. This
allows nested lists of strings to work in the way that most people
would expect. For example:
>>>
items
=
[
'Dave'
,
'Paula'
,
[
'Thomas'
,
'Lewis'
]]
>>>
for
x
in
flatten
(
items
):
...
(
x
)
...
Dave
Paula
Thomas
Lewis
>>>
Discussion
The yield from
statement is a nice shortcut to use if you ever
want to write generators that call other generators as subroutines.
If you donât use it, you need to write code that uses an extra
for
loop. For example:
def
flatten
(
items
,
ignore_types
=
(
str
,
bytes
)):
for
x
in
items
:
if
isinstance
(
x
,
Iterable
)
and
not
isinstance
(
x
,
ignore_types
):
for
i
in
flatten
(
x
):
yield
i
else
:
yield
x
Although itâs only a minor change, the yield from
statement just feels better
and leads to cleaner code.
As noted, the extra check for strings and bytes is there to prevent the
expansion of those types into individual characters. If there are
other types that you donât want expanded, you can supply a different
value for the ignore_types
argument.
Finally, it should be noted that yield from
has a more important
role in advanced programs involving coroutines and generator-based
concurrency. See Recipe 12.12 for another example.
4.15. Iterating in Sorted Order Over Merged Sorted Iterables
Problem
You have a collection of sorted sequences and you want to iterate over a sorted sequence of them all merged together.
Solution
The heapq.merge()
function does exactly what you want.
For example:
>>>
import
heapq
>>>
a
=
[
1
,
4
,
7
,
10
]
>>>
b
=
[
2
,
5
,
6
,
11
]
>>>
for
c
in
heapq
.
merge
(
a
,
b
):
...
(
c
)
...
1
2
4
5
6
7
10
11
Discussion
The iterative nature of heapq.merge
means that it never reads
any of the supplied sequences all at once. This means that you
can use it on very long sequences with very little overhead.
For instance, here is an example of how you would merge two sorted files:
import
heapq
with
open
(
'sorted_file_1'
,
'rt'
)
as
file1
,
\open
(
'sorted_file_2'
)
'rt'
as
file2
,
\open
(
'merged_file'
,
'wt'
)
as
outf
:
for
line
in
heapq
.
merge
(
file1
,
file2
):
outf
.
write
(
line
)
Itâs important to emphasize that heapq.merge()
requires that all of
the input sequences already be sorted. In particular, it does not
first read all of the data into a heap or do any preliminary sorting.
Nor does it perform any kind of validation of the inputs to check if
they meet the ordering requirements. Instead, it simply examines the
set of items from the front of each input sequence and emits
the smallest one found. A new item from the chosen sequence is then
read, and the process repeats itself until all input sequences have
been fully consumed.
4.16. Replacing Infinite while Loops with an Iterator
Problem
You have code that uses a while
loop to iteratively process
data because it involves a function or some kind of unusual
test condition that doesnât fall into the usual iteration pattern.
Solution
A somewhat common scenario in programs involving I/O is to write code like this:
CHUNKSIZE
=
8192
def
reader
(
s
):
while
True
:
data
=
s
.
recv
(
CHUNKSIZE
)
if
data
==
b
''
:
break
process_data
(
data
)
Such code can often be replaced using iter()
, as follows:
def
reader
(
s
):
for
chunk
in
iter
(
lambda
:
s
.
recv
(
CHUNKSIZE
),
b
''
):
process_data
(
chunk
)
If youâre a bit skeptical that it might work, you can try a similar example involving files. For example:
>>>
import
sys
>>>
f
=
open
(
'/etc/passwd'
)
>>>
for
chunk
in
iter
(
lambda
:
f
.
read
(
10
),
''
):
...
n
=
sys
.
stdout
.
write
(
chunk
)
...
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
...
>>>
Discussion
A little-known feature of the built-in iter()
function is that
it optionally accepts a zero-argument callable and sentinel (terminating)
value as inputs. When used in this way, it creates an iterator that
repeatedly calls the supplied callable over and over again until it
returns the value given as a sentinel.
This particular approach works well with certain kinds of repeatedly
called functions, such as those involving I/O. For example, if you want
to read data in chunks from sockets or files, you usually have to repeatedly
execute read()
or recv()
calls followed by an end-of-file test. This
recipe simply takes these two features and combines them together into
a single iter()
call. The use of lambda
in the solution is needed
to create a callable that takes no arguments, yet still supplies the
desired size argument to recv()
or read()
.
Get Python Cookbook, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.