Conditional Frequency Distributions
We introduced frequency distributions in Computing with Language: Simple Statistics. We saw that
given some list mylist
of words or
other items, FreqDist(mylist)
would compute the number of occurrences of each item in the list. Here
we will generalize this idea.
When the texts of a corpus are divided into several categories (by
genre, topic, author, etc.), we can maintain separate frequency
distributions for each category. This will allow us to study systematic
differences between the categories. In the previous section, we achieved
this using NLTK’s ConditionalFreqDist
data type. A conditional
frequency distribution is a collection of frequency
distributions, each one for a different “condition.” The condition will
often be the category of the text. Figure 2-4 depicts
a fragment of a conditional frequency distribution having just two
conditions, one for news text and one for romance text.
Figure 2-4. Counting words appearing in a text collection (a conditional frequency distribution).
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.