Text-based documents contain lots of information. Examples include books, legal documents, social media, and e-mail. Extracting information from text-based documents is critically important to modern AI systems, for example in search engines, legal AI, and automated news services.
Extraction of useful features from text is a difficult problem. Text is not numerical in nature, therefore a model must be used to create features that can be used with data mining algorithms. The good news is that there are some simple models that do a great job at this, including the bag-of-words model that we will use in this chapter.
In this chapter, we look at extracting features from text for use in data mining applications. ...