3 Text Wrangling and Preprocessing

Almost 80% of NLP is data preprocessing. When we do topic modeling using TF-IDF, LDA, LSA, or similar models, we need to prepare the texts. Without text preprocessing, the quality of the model outcome will suffer, and latent information may be buried in the ocean of texts. The well-known phrase garbage in, garbage out (GIGO) refers to this. In this chapter, we will learn the key steps in NLP preprocessing: tokenization, lowercase conversion, stop word removal, punctuation removal, stemming, and lemmatization. The first two of these are very basic, so we will spend more time on the rest. We will learn how to code these steps in spaCy, NLTK, and Gensim. Later, we will build a pipeline for NLP preprocessing applicable ...

Get The Handbook of NLP with Gensim now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

The Handbook of NLP with Gensim by Chris Kuo

3

Text Wrangling and Preprocessing

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly