By now you will have a sense of the capabilities of the Python programming language for processing natural language. However, if you’re new to Python or to programming, you may still be wrestling with Python and not feel like you are in full control yet. In this chapter we’ll address the following questions:
How can you write well-structured, readable programs that you and others will be able to reuse easily?
How do the fundamental building blocks work, such as loops, functions, and assignment?
What are some of the pitfalls with Python programming, and how can you avoid them?
Along the way, you will consolidate your knowledge of fundamental programming constructs, learn more about using features of the Python language in a natural and concise way, and learn some useful techniques in visualizing natural language data. As before, this chapter contains many examples and exercises (and as before, some exercises introduce new material). Readers new to programming should work through them carefully and consult other introductions to programming if necessary; experienced programmers can quickly skim this chapter.
In the other chapters of this book, we have organized the programming concepts as dictated by the needs of NLP. Here we revert to a more conventional approach, where the material is more closely tied to the structure of the programming language. There’s not room for a complete presentation of the language, so we’ll just focus on the language constructs and idioms that are most important for NLP.
Assignment would seem to be the most elementary programming concept, not deserving a separate discussion. However, there are some surprising subtleties here. Consider the following code fragment:
>>> foo = 'Monty' >>> bar = foo >>> foo = 'Python' >>> bar 'Monty'
This behaves exactly as expected. When we write bar = foo
in the code , the value of foo
(the string 'Monty'
) is assigned to bar
. That is, bar
is a copy of foo
, so when we overwrite foo
with a new string 'Python'
on line , the value of bar
is not affected.
However, assignment statements do not always involve making
copies in this way. Assignment always copies the value of an
expression, but a value is not always what you might expect it to be.
In particular, the “value” of a structured object such as a list is
actually just a reference to the object. In the
following example, assigns the
reference of foo
to the new
variable bar
. Now when we modify
something inside foo
on line , we can see that the contents of bar
have also been changed.
>>> foo = ['Monty', 'Python'] >>> bar = foo >>> foo[1] = 'Bodkin' >>> bar ['Monty', 'Bodkin']
The line bar = foo
does not copy the contents of the
variable, only its “object reference.” To understand what is going on
here, we need to know how lists are stored in the computer’s memory.
In Figure 4-1, we see that a list foo
is a reference to an object stored at
location 3133 (which is itself a series of pointers to other locations
holding strings). When we assign bar =
foo
, it is just the object reference 3133 that gets copied.
This behavior extends to other aspects of the language, such as
parameter passing (Functions: The Foundation of Structured Programming).
Figure 4-1. List assignment and computer memory: Two list objects foo
and
bar
reference the same location in the computer’s memory; updating
foo
will also modify bar
, and vice versa.
Let’s experiment some more, by creating a variable empty
holding the empty list, then using it
three times on the next line.
>>> empty = [] >>> nested = [empty, empty, empty] >>> nested [[], [], []] >>> nested[1].append('Python') >>> nested [['Python'], ['Python'], ['Python']]
Observe that changing one of the items inside our nested list of lists changed them all. This is because each of the three elements is actually just a reference to one and the same list in memory.
Note
Your Turn: Use
multiplication to create a list of lists: nested = [[]] * 3
. Now modify one of the
elements of the list, and observe that all the elements are changed.
Use Python’s id()
function to
find out the numerical identifier for any object, and verify that
id(nested[0])
, id(nested[1])
, and
id(nested[2])
are all the
same.
Now, notice that when we assign a new value to one of the elements of the list, it does not propagate to the others:
>>> nested = [[]] * 3 >>> nested[1].append('Python') >>> nested[1] = ['Monty'] >>> nested [['Python'], ['Monty'], ['Python']]
We began with a list containing three references to a single
empty list object. Then we modified that object by appending 'Python'
to it, resulting in a list
containing three references to a single list object ['Python']
. Next, we
overwrote one of those references with a
reference to a new object ['Monty']
. This last step modified one of
the three object references inside the nested list. However, the
['Python']
object wasn’t changed,
and is still referenced from two places in our nested list of lists.
It is crucial to appreciate this difference between modifying an
object via an object reference and overwriting an object
reference.
Note
Important: To copy the
items from a list foo
to a new
list bar
, you can write bar = foo[:]
. This copies the object
references inside the list. To copy a structure without copying any
object references, use copy.deepcopy()
.
Python provides two ways to check that a pair of items are the
same. The is
operator tests for
object identity. We can use it to verify our earlier observations
about objects. First, we create a list containing several copies of
the same object, and demonstrate that they are not only identical
according to ==
, but also that they
are one and the same object:
>>> size = 5 >>> python = ['Python'] >>> snake_nest = [python] * size >>> snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4] True >>> snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4] True
Now let’s put a new python in this nest. We can easily show that the objects are not all identical:
>>> import random >>> position = random.choice(range(size)) >>> snake_nest[position] = ['Python'] >>> snake_nest [['Python'], ['Python'], ['Python'], ['Python'], ['Python']] >>> snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4] True >>> snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4] False
You can do several pairwise tests to discover which position
contains the interloper, but the id()
function makes detection easier:
>>> [id(snake) for snake in snake_nest] [513528, 533168, 513528, 513528, 513528]
This reveals that the second item of the list has a distinct identifier. If you try running this code snippet yourself, expect to see different numbers in the resulting list, and don’t be surprised if the interloper is in a different position.
Having two kinds of equality might seem strange. However, it’s really just the type-token distinction, familiar from natural language, here showing up in a programming language.
In the condition part of an if
statement, a non-empty string or list is
evaluated as true, while an empty string or list evaluates as
false.
>>> mixed = ['cat', '', ['dog'], []] >>> for element in mixed: ... if element: ... print element ... cat ['dog']
That is, we don’t need to say if len(element) > 0:
in the
condition.
What’s the difference between using if...elif
as opposed to using a couple of
if
statements in a row? Well,
consider the following situation:
>>> animals = ['cat', 'dog'] >>> if 'cat' in animals: ... print 1 ... elif 'dog' in animals: ... print 2 ... 1
Since the if
clause of the
statement is satisfied, Python never tries to evaluate the elif
clause, so we never get to print out
2
. By contrast, if we replaced the
elif
by an if
, then we would print out both 1
and 2
.
So an elif
clause potentially gives
us more information than a bare if
clause; when it evaluates to true, it tells us not only that the
condition is satisfied, but also that the condition of the main
if
clause was
not satisfied.
The functions all()
and
any()
can be applied to a list (or
other sequence) to check whether all or any items meet some
condition:
>>> sent = ['No', 'good', 'fish', 'goes', 'anywhere', 'without', 'a', 'porpoise', '.'] >>> all(len(w) > 4 for w in sent) False >>> any(len(w) > 4 for w in sent) True
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.