Normalizing Text
In earlier program examples we have often converted text to
lowercase before doing anything with its words, e.g., set(w.lower() for w in text)
. By using
lower()
, we have normalized the text to lowercase so that the
distinction between The and
the is ignored. Often we want to go further than
this and strip off any affixes, a task known as stemming. A further step
is to make sure that the resulting form is a known word in a dictionary,
a task known as lemmatization. We discuss each of these in turn. First,
we need to define the data we will use in this section:
>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords ... is no basis for a system of government. Supreme executive power derives from ... a mandate from the masses, not from some farcical aquatic ceremony.""" >>> tokens = nltk.word_tokenize(raw)
Stemmers
NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer, you should use one of these in preference to crafting your own using regular expressions, since NLTK’s stemmers handle a wide range of irregular cases. The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), whereas the Lancaster stemmer does not.
>>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> [porter.stem(t) for t in tokens] ['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.