We will now look at training our own POS tagger, using NLTK's tagged set corpora and the sklearn random forest machine learning (ML) model. The complete Jupyter Notebook for this section is available at Chapter02/02_example.ipynb, in the book's code repository. This will be a classification task, as we need to predict the POS tag for a given word in a sentence. We will utilize the NLTK treebank dataset, with POS tags, as the training or labeled data. We will extract the word prefixes and suffixes, and previous and neighboring words in the text, as features for the training. These features are good indicators for categorizing words to different parts of speech. The code that follows shows how we can extract these features: ...
Training a POS tagger
Get Hands-On Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.