Machine Learning with Python Cookbook, 2nd Edition

Book description

This practical guide provides more than 200 self-contained recipes to help you solve machine learning challenges you may encounter in your work. If you're comfortable with Python and its libraries, including pandas and scikit-learn, you'll be able to address specific problems, from loading data to training models and leveraging neural networks.

Each recipe in this updated edition includes code that you can copy, paste, and run with a toy dataset to ensure that it works. From there, you can adapt these recipes according to your use case or application. Recipes include a discussion that explains the solution and provides meaningful context.

Go beyond theory and concepts by learning the nuts and bolts you need to construct working machine learning applications. You'll find recipes for:

  • Vectors, matrices, and arrays
  • Working with data from CSV, JSON, SQL, databases, cloud storage, and other sources
  • Handling numerical and categorical data, text, images, and dates and times
  • Dimensionality reduction using feature extraction or feature selection
  • Model evaluation and selection
  • Linear and logical regression, trees and forests, and k-nearest neighbors
  • Supporting vector machines (SVM), naäve Bayes, clustering, and tree-based models
  • Saving, loading, and serving trained models from multiple frameworks

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  2. 1. Working with Vectors, Matrices, and Arrays in NumPy
    1. 1.0. Introduction
    2. 1.1. Creating a Vector
    3. 1.2. Creating a Matrix
    4. 1.3. Creating a Sparse Matrix
    5. 1.4. Preallocating NumPy Arrays
    6. 1.5. Selecting Elements
    7. 1.6. Describing a Matrix
    8. 1.7. Applying Functions over Each Element
    9. 1.8. Finding the Maximum and Minimum Values
    10. 1.9. Calculating the Average, Variance, and Standard Deviation
    11. 1.10. Reshaping Arrays
    12. 1.11. Transposing a Vector or Matrix
    13. 1.12. Flattening a Matrix
    14. 1.13. Finding the Rank of a Matrix
    15. 1.14. Getting the Diagonal of a Matrix
    16. 1.15. Calculating the Trace of a Matrix
    17. 1.16. Calculating Dot Products
    18. 1.17. Adding and Subtracting Matrices
    19. 1.18. Multiplying Matrices
    20. 1.19. Inverting a Matrix
    21. 1.20. Generating Random Values
  3. 2. Loading Data
    1. 2.0. Introduction
    2. 2.1. Loading a Sample Dataset
    3. 2.2. Creating a Simulated Dataset
    4. 2.3. Loading a CSV File
    5. 2.4. Loading an Excel File
    6. 2.5. Loading a JSON File
    7. 2.6. Loading a Parquet File
    8. 2.7. Loading an Avro File
    9. 2.8. Querying a SQLite Database
    10. 2.9. Querying a Remote SQL Database
    11. 2.10. Loading Data from a Google Sheet
    12. 2.11. Loading Data from an S3 Bucket
    13. 2.12. Loading Unstructured Data
  4. 3. Data Wrangling
    1. 3.0. Introduction
    2. 3.1. Creating a Dataframe
    3. 3.2. Getting Information about the Data
    4. 3.3. Slicing DataFrames
    5. 3.4. Selecting Rows Based on Conditionals
    6. 3.5. Sorting Values
    7. 3.6. Replacing Values
    8. 3.7. Renaming Columns
    9. 3.8. Finding the Minimum, Maximum, Sum, Average, and Count
    10. 3.9. Finding Unique Values
    11. 3.10. Handling Missing Values
    12. 3.11. Deleting a Column
    13. 3.12. Deleting a Row
    14. 3.13. Dropping Duplicate Rows
    15. 3.14. Grouping Rows by Values
    16. 3.15. Grouping Rows by Time
    17. 3.16. Aggregating Operations and Statistics
    18. 3.17. Looping over a Column
    19. 3.18. Applying a Function over All Elements in a Column
    20. 3.19. Applying a Function to Groups
    21. 3.20. Concatenating DataFrames
    22. 3.21. Merging DataFrames
  5. 4. Handling Numerical Data
    1. 4.0. Introduction
    2. 4.1. Rescaling a Feature
    3. 4.2. Standardizing a Feature
    4. 4.3. Normalizing Observations
    5. 4.4. Generating Polynomial and Interaction Features
    6. 4.5. Transforming Features
    7. 4.6. Detecting Outliers
    8. 4.7. Handling Outliers
    9. 4.8. Discretizating Features
    10. 4.9. Grouping Observations Using Clustering
    11. 4.10. Deleting Observations with Missing Values
    12. 4.11. Imputing Missing Values
  6. 5. Handling Categorical Data
    1. 5.0. Introduction
    2. 5.1. Encoding Nominal Categorical Features
    3. 5.2. Encoding Ordinal Categorical Features
    4. 5.3. Encoding Dictionaries of Features
    5. 5.4. Imputing Missing Class Values
    6. 5.5. Handling Imbalanced Classes
  7. 6. Handling Text
    1. 6.0. Introduction
    2. 6.1. Cleaning Text
    3. 6.2. Parsing and Cleaning HTML
    4. 6.3. Removing Punctuation
    5. 6.4. Tokenizing Text
    6. 6.5. Removing Stop Words
    7. 6.6. Stemming Words
    8. 6.7. Tagging Parts of Speech
    9. 6.8. Performing Named-Entity Recognition
    10. 6.9. Encoding Text as a Bag of Words
    11. 6.10. Weighting Word Importance
    12. 6.11. Using Text Vectors to Calculate Text Similarity in a Search Query
    13. 6.12. Using a Sentiment Analysis Classifier
  8. 7. Handling Dates and Times
    1. 7.0. Introduction
    2. 7.1. Converting Strings to Dates
    3. 7.2. Handling Time Zones
    4. 7.3. Selecting Dates and Times
    5. 7.4. Breaking Up Date Data into Multiple Features
    6. 7.5. Calculating the Difference Between Dates
    7. 7.6. Encoding Days of the Week
    8. 7.7. Creating a Lagged Feature
    9. 7.8. Using Rolling Time Windows
    10. 7.9. Handling Missing Data in Time Series
  9. 8. Handling Images
    1. 8.0. Introduction
    2. 8.1. Loading Images
    3. 8.2. Saving Images
    4. 8.3. Resizing Images
    5. 8.4. Cropping Images
    6. 8.5. Blurring Images
    7. 8.6. Sharpening Images
    8. 8.7. Enhancing Contrast
    9. 8.8. Isolating Colors
    10. 8.9. Binarizing Images
    11. 8.10. Removing Backgrounds
    12. 8.11. Detecting Edges
    13. 8.12. Detecting Corners
    14. 8.13. Creating Features for Machine Learning
    15. 8.14. Encoding Color Histograms as Features
    16. 8.15. Using Pretrained Embeddings as Features
    17. 8.16. Detecting Objects with OpenCV
    18. 8.17. Classifying Images with Pytorch
  10. 9. Dimensionality Reduction Using Feature Extraction
    1. 9.0. Introduction
    2. 9.1. Reducing Features Using Principal Components
    3. 9.2. Reducing Features When Data Is Linearly Inseparable
    4. 9.3. Reducing Features by Maximizing Class Separability
    5. 9.4. Reducing Features Using Matrix Factorization
    6. 9.5. Reducing Features on Sparse Data
  11. 10. Dimensionality Reduction Using Feature Selection
    1. 10.0. Introduction
    2. 10.1. Thresholding Numerical Feature Variance
    3. 10.2. Thresholding Binary Feature Variance
    4. 10.3. Handling Highly Correlated Features
    5. 10.4. Removing Irrelevant Features for Classification
    6. 10.5. Recursively Eliminating Features
  12. 11. Model Evaluation
    1. 11.0. Introduction
    2. 11.1. Cross-Validating Models
    3. 11.2. Creating a Baseline Regression Model
    4. 11.3. Creating a Baseline Classification Model
    5. 11.4. Evaluating Binary Classifier Predictions
    6. 11.5. Evaluating Binary Classifier Thresholds
    7. 11.6. Evaluating Multiclass Classifier Predictions
    8. 11.7. Visualizing a Classifier’s Performance
    9. 11.8. Evaluating Regression Models
    10. 11.9. Evaluating Clustering Models
    11. 11.10. Creating a Custom Evaluation Metric
    12. 11.11. Visualizing the Effect of Training Set Size
    13. 11.12. Creating a Text Report of Evaluation Metrics
    14. 11.13. Visualizing the Effect of Hyperparameter Values
  13. 12. Model Selection
    1. 12.0. Introduction
    2. 12.1. Selecting the Best Models Using Exhaustive Search
    3. 12.2. Selecting the Best Models Using Randomized Search
    4. 12.3. Selecting the Best Models from Multiple Learning Algorithms
    5. 12.4. Selecting the Best Models When Preprocessing
    6. 12.5. Speeding Up Model Selection with Parallelization
    7. 12.6. Speeding Up Model Selection Using Algorithm-Specific Methods
    8. 12.7. Evaluating Performance After Model Selection
  14. 13. Linear Regression
    1. 13.0. Introduction
    2. 13.1. Fitting a Line
    3. 13.2. Handling Interactive Effects
    4. 13.3. Fitting a Nonlinear Relationship
    5. 13.4. Reducing Variance with Regularization
    6. 13.5. Reducing Features with Lasso Regression
  15. 14. Trees and Forests
    1. 14.0. Introduction
    2. 14.1. Training a Decision Tree Classifier
    3. 14.2. Training a Decision Tree Regressor
    4. 14.3. Visualizing a Decision Tree Model
    5. 14.4. Training a Random Forest Classifier
    6. 14.5. Training a Random Forest Regressor
    7. 14.6. Evaluating Random Forests with Out-of-Bag Errors
    8. 14.7. Identifying Important Features in Random Forests
    9. 14.8. Selecting Important Features in Random Forests
    10. 14.9. Handling Imbalanced Classes
    11. 14.10. Controlling Tree Size
    12. 14.11. Improving Performance Through Boosting
    13. 14.12. Training an XGBoost Model
    14. 14.13. Improving Real-Time Performance with LightGBM
  16. 15. K-Nearest Neighbors
    1. 15.0. Introduction
    2. 15.1. Finding an Observation’s Nearest Neighbors
    3. 15.2. Creating a K-Nearest Neighbors Classifier
    4. 15.3. Identifying the Best Neighborhood Size
    5. 15.4. Creating a Radius-Based Nearest Neighbors Classifier
    6. 15.5. Finding Approximate Nearest Neighbors
    7. 15.6. Evaluating Approximate Nearest Neighbors
  17. 16. Logistic Regression
    1. 16.0. Introduction
    2. 16.1. Training a Binary Classifier
    3. 16.2. Training a Multiclass Classifier
    4. 16.3. Reducing Variance Through Regularization
    5. 16.4. Training a Classifier on Very Large Data
    6. 16.5. Handling Imbalanced Classes
  18. 17. Support Vector Machines
    1. 17.0. Introduction
    2. 17.1. Training a Linear Classifier
    3. 17.2. Handling Linearly Inseparable Classes Using Kernels
    4. 17.3. Creating Predicted Probabilities
    5. 17.4. Identifying Support Vectors
    6. 17.5. Handling Imbalanced Classes
  19. 18. Naive Bayes
    1. 18.0. Introduction
    2. 18.1. Training a Classifier for Continuous Features
    3. 18.2. Training a Classifier for Discrete and Count Features
    4. 18.3. Training a Naive Bayes Classifier for Binary Features
    5. 18.4. Calibrating Predicted Probabilities
  20. 19. Clustering
    1. 19.0. Introduction
    2. 19.1. Clustering Using K-Means
    3. 19.2. Speeding Up K-Means Clustering
    4. 19.3. Clustering Using Mean Shift
    5. 19.4. Clustering Using DBSCAN
    6. 19.5. Clustering Using Hierarchical Merging
  21. 20. Tensors with PyTorch
    1. 20.0. Introduction
    2. 20.1. Creating a Tensor
    3. 20.2. Creating a Tensor from NumPy
    4. 20.3. Creating a Sparse Tensor
    5. 20.4. Selecting Elements in a Tensor
    6. 20.5. Describing a Tensor
    7. 20.6. Applying Operations to Elements
    8. 20.7. Finding the Maximum and Minimum Values
    9. 20.8. Reshaping Tensors
    10. 20.9. Transposing a Tensor
    11. 20.10. Flattening a Tensor
    12. 20.11. Calculating Dot Products
    13. 20.12. Multiplying Tensors
  22. 21. Neural Networks
    1. 21.0. Introduction
    2. 21.1. Using Autograd with PyTorch
    3. 21.2. Preprocessing Data for Neural Networks
    4. 21.3. Designing a Neural Network
    5. 21.4. Training a Binary Classifier
    6. 21.5. Training a Multiclass Classifier
    7. 21.6. Training a Regressor
    8. 21.7. Making Predictions
    9. 21.8. Visualize Training History
    10. 21.9. Reducing Overfitting with Weight Regularization
    11. 21.10. Reducing Overfitting with Early Stopping
    12. 21.11. Reducing Overfitting with Dropout
    13. 21.12. Saving Model Training Progress
    14. 21.13. Tuning Neural Networks
    15. 21.14. Visualizing Neural Networks
  23. 22. Neural Networks for Unstructured Data
    1. 22.0. Introduction
    2. 22.1. Training a Neural Network for Image Classification
    3. 22.2. Training a Neural Network for Text Classification
    4. 22.3. Fine-Tuning a Pretrained Model for Image Classification
    5. 22.4. Fine-Tuning a Pretrained Model for Text Classification
  24. 23. Saving, Loading, and Serving Trained Models
    1. 23.0. Introduction
    2. 23.1. Saving and Loading a scikit-learn Model
    3. 23.2. Saving and Loading a TensorFlow Model
    4. 23.3. Saving and Loading a PyTorch Model
    5. 23.4. Serving scikit-learn Models
    6. 23.5. Serving TensorFlow Models
    7. 23.6. Serving PyTorch Models in Seldon
  25. Index
  26. About the Authors

Product information

  • Title: Machine Learning with Python Cookbook, 2nd Edition
  • Author(s): Kyle Gallatin, Chris Albon
  • Release date: August 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098135720