3 Math with words (TF-IDF vectors)

This chapter covers

  • Counting words and term frequencies to analyze meaning
  • Predicting word occurrence probabilities with Zipf’s Law
  • Vector representation of words and how to start using them
  • Finding relevant documents from a corpus using inverse document frequencies
  • Estimating the similarity of pairs of documents with cosine similarity and Okapi BM25

Having collected and counted words (tokens), and bucketed them into stems or lemmas, it’s time to do something interesting with them. Detecting words is useful for simple tasks, like getting statistics about word usage or doing keyword search. But you’d like to know which words are more important to a particular document and across the corpus as a whole. Then ...

Get Natural Language Processing in Action now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.