Chapter 2. Accessing Text Corpora and Lexical Resources
Practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora. The goal of this chapter is to answer the following questions:
What are some useful text corpora and lexical resources, and how can we access them with Python?
Which Python constructs are most helpful for this work?
How do we avoid repeating ourselves when writing Python code?
This chapter continues to present programming concepts by example, in the context of a linguistic processing task. We will wait until later before exploring each Python construct systematically. Don’t worry if you see an example that contains something unfamiliar; simply try it out and see what it does, and—if you’re game—modify it by substituting some part of the code with a different text or word. This way you will associate a task with a programming idiom, and learn the hows and whys later.
Accessing Text Corpora
As just mentioned, a text corpus is a large body of text. Many
corpora are designed to contain a careful balance of material in one or
more genres. We examined some small text collections in Chapter 1, such as the speeches known as the US Presidential
Inaugural Addresses. This particular corpus actually contains dozens of
individual texts—one per address—but for convenience we glued them
end-to-end and treated them as a single text. Chapter 1 also used various predefined texts that we
accessed by typing from book import
*
. However, since we ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.