2

Text Representation

A computer operates on zeros and ones, and algorithms operate on numerical values. A computer does not understand beautiful texts such as the plays by William Shakespeare or the books by Leo Tolstoy. So, raw texts need to be converted to numerical values for a computer to process. The first step in NLP is converting texts to numerical values.

In this chapter, we will learn about the basic text representation – Bag-of-Words, Bag-of-N-grams, and TF-IDF. This chapter is for absolute NLP beginners. In this chapter, we will learn how to code with Gensim, scikit-learn, and NLTK. We will cover the following topics:

  • What text representation is
  • The transition from one-hot encoding to Bag-of-Words to Bag-of-N-grams
  • What TF-IDF is ...

Get The Handbook of NLP with Gensim now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.