Lexical Resources
A lexicon, or lexical resource, is a collection of words and/or
phrases along with associated information, such as part-of-speech and
sense definitions. Lexical resources are secondary to texts, and are
usually created and enriched with the help of texts. For example, if we
have defined a text my_text
, then
vocab = sorted(set(my_text))
builds
the vocabulary of my_text
, whereas
word_freq =
FreqDist(my_text)
counts the frequency of each word in the text. Both
vocab
and word_freq
are simple lexical resources.
Similarly, a concordance like the one we saw in Computing with Language: Texts and Words gives us
information about word usage that might help in the preparation of a
dictionary. Standard terminology for lexicons is illustrated in Figure 2-5. A lexical
entry consists of a headword (also known as a lemma) along with additional information, such
as the part-of-speech and the sense definition. Two distinct words
having the same spelling are called homonyms.
Figure 2-5. Lexicon terminology: Lexical entries for two lemmas having the same spelling (homonyms), providing part-of-speech and gloss information.
The simplest kind of lexicon is nothing more than a sorted list of words. Sophisticated lexicons include complex structure within and across the individual entries. In this section, we’ll look at some lexical resources included with NLTK.
Wordlist Corpora
NLTK includes ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.