So far, our little programs have had some interesting qualities: the ability to work with language, and the potential to save human effort through automation. A key feature of programming is the ability of machines to make decisions on our behalf, executing instructions when certain conditions are met, or repeatedly looping through text data until some condition is satisfied. This feature is known as control, and is the focus of this section.
Python supports a wide range of operators, such as <
and >=
, for testing the relationship between
values. The full set of these relational
operators are shown in Table 1-3.
Table 1-3. Numerical comparison operators
Operator | Relationship |
---|---|
| Less than |
| Less than or equal to |
| Equal to (note this is two “ |
| Not equal to |
| Greater than |
| Greater than or equal to |
We can use these to select different words from a sentence of
news text. Here are some examples—notice only the operator is changed
from one line to the next. They all use sent7
, the first sentence from text7
(Wall Street
Journal). As before, if you get an error saying that
sent7
is undefined, you need to
first type: from nltk.book import
*
.
>>> sent7 ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] >>> [w for w in sent7 if len(w) < 4] [',', '61', 'old', ',', 'the', 'as', 'a', '29', '.'] >>> [w for w in sent7 if len(w) <= 4] [',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.'] >>> [w for w in sent7 if len(w) == 4] ['will', 'join', 'Nov.'] >>> [w for w in sent7 if len(w) != 4] ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', '29', '.'] >>>
There is a common pattern to all of these examples: [w for w in text if
condition
]
, where
condition
is a Python
“test” that yields either true or false. In the cases shown in the
previous code example, the condition is always a numerical comparison.
However, we can also test various properties of words, using the
functions listed in Table 1-4.
Table 1-4. Some word comparison operators
Function | Meaning |
---|---|
| Test if |
| Test if |
| Test if |
| Test if all cased characters in |
| Test if all cased characters in |
| Test if all characters in |
| Test if all characters in |
| Test if all characters in |
| Test if |
Here are some examples of these operators being used to select words from our texts: words ending with -ableness; words containing gnt; words having an initial capital; and words consisting entirely of digits.
>>> sorted([w for w in set(text1) if w.endswith('ableness')]) ['comfortableness', 'honourableness', 'immutableness', 'indispensableness', ...] >>> sorted([term for term in set(text4) if 'gnt' in term]) ['Sovereignty', 'sovereignties', 'sovereignty'] >>> sorted([item for item in set(text6) if item.istitle()]) ['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', 'Aaagh', ...] >>> sorted([item for item in set(sent7) if item.isdigit()]) ['29', '61'] >>>
We can also create more complex conditions. If
c is a condition, then not
c is also a
condition. If we have two conditions
c1 and
c2, then we can combine
them to form a new condition using conjunction and disjunction:
c1 and
c2,
c1 or
c2.
Note
Your Turn: Run the following examples and try to explain what is going on in each one. Next, try to make up some conditions of your own.
>>> sorted([w for w in set(text7) if '-' in w and 'index' in w]) >>> sorted([wd for wd in set(text3) if wd.istitle() and len(wd) > 10]) >>> sorted([w for w in set(sent7) if not w.islower()]) >>> sorted([t for t in set(text2) if 'cie' in t or 'cei' in t])
In Computing with Language: Simple Statistics, we saw some examples of counting items other than words. Let’s take a closer look at the notation we used:
>>> [len(w) for w in text1] [1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...] >>> [w.upper() for w in text1] ['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...] >>>
These expressions have the form [f(w)
for ...]
or [w.f() for
...]
, where f
is a
function that operates on a word to compute its length, or to convert
it to uppercase. For now, you don’t need to understand the difference
between the notations f(w)
and
w.f()
. Instead, simply learn this
Python idiom which performs the same operation on every element of a
list. In the preceding examples, it goes through each word in text1
, assigning each one in turn to the
variable w
and performing the
specified operation on the variable.
Note
The notation just described is called a “list comprehension.” This is our first example of a Python idiom, a fixed notation that we use habitually without bothering to analyze each time. Mastering such idioms is an important part of becoming a fluent Python programmer.
Let’s return to the question of vocabulary size, and apply the same idiom here:
>>> len(text1) 260819 >>> len(set(text1)) 19317 >>> len(set([word.lower() for word in text1])) 17231 >>>
Now that we are not double-counting words like This and this, which differ only in capitalization, we’ve wiped 2,000 off the vocabulary count! We can go a step further and eliminate numbers and punctuation from the vocabulary count by filtering out any non-alphabetic items:
>>> len(set([word.lower() for word in text1 if word.isalpha()])) 16948 >>>
This example is slightly complicated: it lowercases all the purely alphabetic items. Perhaps it would have been simpler just to count the lowercase-only items, but this gives the wrong answer (why?).
Don’t worry if you don’t feel confident with list comprehensions yet, since you’ll see many more examples along with explanations in the following chapters.
Most programming languages permit us to execute a block of code
when a conditional expression, or
if
statement, is satisfied. We
already saw examples of conditional tests in code like [w for w in sent7 if len(w) < 4]
. In the
following program, we have created a variable called word
containing the string value 'cat'
. The if
statement checks whether the test
len(word) < 5
is true. It is, so
the body of the if
statement is
invoked and the print
statement is
executed, displaying a message to the user. Remember to indent the
print
statement by typing four
spaces.
>>> word = 'cat' >>> if len(word) < 5: ... print 'word length is less than 5' ... word length is less than 5 >>>
When we use the Python interpreter we have to add an extra blank line in order for it to detect that the nested block is complete.
If we change the conditional test to len(word) >= 5
, to check that the length
of word
is greater than or equal to
5
, then the test will no longer be
true. This time, the body of the if
statement will not be executed, and no message is shown to the
user:
>>> if len(word) >= 5: ... print 'word length is greater than or equal to 5' ... >>>
An if
statement is known as a
control structure because it
controls whether the code in the indented block will be run. Another
control structure is the for
loop.
Try the following, and remember to include the colon and the four
spaces:
>>> for word in ['Call', 'me', 'Ishmael', '.']: ... print word ... Call me Ishmael . >>>
This is called a loop because Python executes the code in
circular fashion. It starts by performing the assignment word = 'Call'
, effectively using the
word
variable to name the first
item of the list. Then, it displays the value of word
to the user. Next, it goes back to the
for
statement, and performs the
assignment word = 'me'
before
displaying this new value to the user, and so on. It continues in this
fashion until every item of the list has been processed.
Now we can combine the if
and
for
statements. We will loop over
every item of the list, and print the item only if it ends with the
letter l. We’ll pick another name for the
variable to demonstrate that Python doesn’t try to make sense of
variable names.
>>> sent1 = ['Call', 'me', 'Ishmael', '.'] >>> for xyzzy in sent1: ... if xyzzy.endswith('l'): ... print xyzzy ... Call Ishmael >>>
You will notice that if
and
for
statements have a colon at the
end of the line, before the indentation begins. In fact, all Python
control structures end with a colon. The colon indicates that the
current statement relates to the indented block that
follows.
We can also specify an action to be taken if the condition of
the if
statement is not met. Here
we see the elif
(else if)
statement, and the else
statement.
Notice that these also have colons before the indented code.
>>> for token in sent1: ... if token.islower(): ... print token, 'is a lowercase word' ... elif token.istitle(): ... print token, 'is a titlecase word' ... else: ... print token, 'is punctuation' ... Call is a titlecase word me is a lowercase word Ishmael is a titlecase word . is punctuation >>>
As you can see, even with this small amount of Python knowledge, you can start to build multiline Python programs. It’s important to develop such programs in pieces, testing that each piece does what you expect before combining them into a program. This is why the Python interactive interpreter is so invaluable, and why you should get comfortable using it.
Finally, let’s combine the idioms we’ve been exploring. First, we create a list of cie and cei words, then we loop over each item and print it. Notice the comma at the end of the print statement, which tells Python to produce its output on a single line.
>>> tricky = sorted([w for w in set(text2) if 'cie' in w or 'cei' in w]) >>> for word in tricky: ... print word, ancient ceiling conceit conceited conceive conscience conscientious conscientiously deceitful deceive ... >>>
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.