CHAPTER 10Machine Learning with Text Documents
The word document sounds too formal when you take a moment to consider the amount of text that is stored. That may take the form of a word-processed document, a blog post, an email, a news article, or an academic paper. When you pause to consider the amount of text data held on the Internet and the Web, well, it's a lot, making sense of it is going to take some doing.
Text analysis, and the machine learning from it, is not the easiest thing in the world to do. Documents are messy, there's a fair amount of cleaning to do, and they come in all sorts of different formats, which usually presents challenges too.
For this chapter I will describe various working methods of finding information from text documents but will also cover the steps of getting data ready for analysis. From there I'll show you three methods to learning from your text: TF/IDF, Word2Vec, and using neural networks to generate new text.
As a further study of text analysis, it's worth looking into the more advanced techniques like using Long Short Term Memory (LSTM) for improved results especially in context awareness. Google has designed a neural network architecture called Bidirectional Encoder Representations from Transformers (BERT); a basic overview is available here:
Preparing Text for Analysis
Let's start at the ...
Get Machine Learning, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.