Book description
Turning text into valuable information is essential for businesses looking to gain a competitive advantage. With recent improvements in natural language processing (NLP), users now have many options for solving complex challenges. But it's not always clear which NLP tools or libraries would work for a business's needs, or which techniques you should use and in what order.
This practical book provides data scientists and developers with blueprints for best practice solutions to common tasks in text analytics and natural language processing. Authors Jens Albrecht, Sidharth Ramachandran, and Christian Winkler provide real-world case studies and detailed code examples in Python to help you get started quickly.
- Extract data from APIs and web pages
- Prepare textual data for statistical analysis and machine learning
- Use machine learning for classification, topic modeling, and summarization
- Explain AI models and classification results
- Explore and visualize semantic similarities with word embeddings
- Identify customer sentiment in product reviews
- Create a knowledge graph based on named entities and their relations
Publisher resources
Table of contents
- Preface
-
1. Gaining Early Insights from Textual Data
- What Youâll Learn and What Weâll Build
- Exploratory Data Analysis
- Introducing the Dataset
- Blueprint: Getting an Overview of the Data with Pandas
- Blueprint: Building a Simple Text Preprocessing Pipeline
- Blueprints for Word Frequency Analysis
- Blueprint: Finding a Keyword-in-Context
- Blueprint: Analyzing N-Grams
- Blueprint: Comparing Frequencies Across Time Intervals and Categories
- Closing Remarks
- 2. Extracting Textual Insights with APIs
-
3. Scraping Websites and Extracting Data
- What Youâll Learn and What Weâll Build
- Scraping and Data Extraction
- Introducing the Reuters News Archive
- URL Generation
- Blueprint: Downloading and Interpreting robots.txt
- Blueprint: Finding URLs from sitemap.xml
- Blueprint: Finding URLs from RSS
- Downloading Data
- Blueprint: Downloading HTML Pages with Python
- Blueprint: Downloading HTML Pages with wget
- Extracting Semistructured Data
- Blueprint: Extracting Data with Regular Expressions
- Blueprint: Using an HTML Parser for Extraction
- Blueprint: Spidering
- Density-Based Text Extraction
- All-in-One Approach
- Blueprint: Scraping the Reuters Archive with Scrapy
- Possible Problems with Scraping
- Closing Remarks and Recommendation
- 4. Preparing Textual Data for Statistics and Machine Learning
-
5. Feature Engineering and Syntactic Similarity
- What Youâll Learn and What Weâll Build
- A Toy Dataset for Experimentation
- Blueprint: Building Your Own Vectorizer
- Bag-of-Words Models
-
TF-IDF Models
- Optimized Document Vectors with TfidfTransformer
- Introducing the ABC Dataset
- Blueprint: Reducing Feature Dimensions
- Blueprint: Improving Features by Making Them More Specific
- Blueprint: Using Lemmas Instead of Words for Vectorizing Documents
- Blueprint: Limit Word Types
- Blueprint: Remove Most Common Words
- Blueprint: Adding Context via N-Grams
- Syntactic Similarity in the ABC Dataset
- Summary and Conclusion
-
6. Text Classification Algorithms
- What Youâll Learn and What Weâll Build
- Introducing the Java Development Tools Bug Dataset
- Blueprint: Building a Text Classification System
- Final Blueprint for Text Classification
- Blueprint: Using Cross-Validation to Estimate Realistic Accuracy Metrics
- Blueprint: Performing Hyperparameter Tuning with Grid Search
- Blueprint Recap and Conclusion
- Closing Remarks
- Further Reading
-
7. How to Explain a Text Classifier
- What Youâll Learn and What Weâll Build
- Blueprint: Determining Classification Confidence Using Prediction Probability
- Blueprint: Measuring Feature Importance of Predictive Models
- Blueprint: Using LIME to Explain the Classification Results
- Blueprint: Using ELI5 to Explain the Classification Results
- Blueprint: Using Anchor to Explain the Classification Results
- Closing Remarks
-
8. Unsupervised Methods: Topic Modeling and Clustering
- What Youâll Learn and What Weâll Build
- Our Dataset: UN General Debates
- Nonnegative Matrix Factorization (NMF)
- Latent Semantic Analysis/Indexing
- Latent Dirichlet Allocation
- Blueprint: Using Word Clouds to Display and Compare Topic Models
- Blueprint: Calculating Topic Distribution of Documents and Time Evolution
- Using Gensim for Topic Modeling
- Blueprint: Using Clustering to Uncover the Structure of Text Data
- Further Ideas
- Summary and Recommendation
- Conclusion
-
9. Text Summarization
- What Youâll Learn and What Weâll Build
- Text Summarization
- Blueprint: Summarizing Text Using Topic Representation
- Blueprint: Summarizing Text Using an Indicator Representation
- Measuring the Performance of Text Summarization Methods
- Blueprint: Summarizing Text Using Machine Learning
- Closing Remarks
- Further Reading
- 10. Exploring Semantic Relationships with Word Embeddings
-
11. Performing Sentiment Analysis on Text Data
- What Youâll Learn and What Weâll Build
- Sentiment Analysis
- Introducing the Amazon Customer Reviews Dataset
- Blueprint: Performing Sentiment Analysis Using Lexicon-Based Approaches
- Supervised Learning Approaches
- Blueprint: Vectorizing Text Data and Applying a Supervised Machine Learning Algorithm
- Pretrained Language Models Using Deep Learning
- Blueprint: Using the Transfer Learning Technique and a Pretrained Language Model
- Closing Remarks
- Further Reading
- 12. Building a Knowledge Graph
-
13. Using Text Analytics in Production
- What Youâll Learn and What Weâll Build
- Blueprint: Using Conda to Create Reproducible Python Environments
- Blueprint: Using Containers to Create Reproducible Environments
- Blueprint: Creating a REST API for Your Text Analytics Model
- Blueprint: Deploying and Scaling Your API Using a Cloud Provider
- Blueprint: Automatically Versioning and Deploying Builds
- Closing Remarks
- Further Reading
- Index
Product information
- Title: Blueprints for Text Analytics Using Python
- Author(s):
- Release date: December 2020
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781492074038
You might also like
book
Python for Data Analysis, 3rd Edition
Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python …
book
Python Data Science Handbook
For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, …
book
Python for Data Analysis, 2nd Edition
Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.6, …
book
Python Data Science Handbook, 2nd Edition
Python is a first-class tool for many researchers, primarily because of its libraries for storing, manipulating, …