The Art of Machine Learning

Book description

Machine learning without advanced math! This book presents a serious, practical look at machine learning, preparing you for valuable insights on your own data. The Art of Machine Learning is packed with real dataset examples and sophisticated advice on how to make full use of powerful machine learning methods. Readers will need only an intuitive grasp of charts, graphs, and the slope of a line, as well as familiarity with the R programming language. You’ll become skilled in a range of machine learning methods, starting with the simple k-Nearest Neighbors method (k-NN), then on to random forests, gradient boosting, linear/logistic models, support vector machines, the LASSO, and neural networks.Final chapters introduce text and image classification, as well as time series. You’ll learn not only how to use machine learning methods, but also why these methods work, providing the strong foundational background you’ll need in practice. Additional features:

  • How to avoid common problems, such as dealing with “dirty” data and factor variables with large numbers of levels
  • A look at typical misconceptions, such as dealing with unbalanced data
  • Exploration of the famous Bias-Variance Tradeoff, central to machine learning, and how it plays out in practice for each machine learning method
  • Dozens of illustrative examples involving real datasets of varying size and field of application
  • Standard R packages are used throughout, with a simple wrapper interface to provide convenient access.

After finishing this book, you will be well equipped to start applying machine learning techniques to your own datasets.

Publisher resources

View/Submit Errata

Table of contents

  1. Cover Page
  2. Title Page
  3. Copyright Page
  4. About the Author
  5. About the Technical Reviewer
  6. BRIEF CONTENTS
  7. CONTENTS IN DETAIL
  8. ACKNOWLEDGMENTS
  9. INTRODUCTION
    1. 0.1 What Is ML?
    2. 0.2 The Role of Math in ML Theory and Practice
    3. 0.3 Why Another ML Book?
    4. 0.4 Recurring Special Sections
    5. 0.5 Background Needed
    6. 0.6 The qe*-Series Software
    7. 0.7 The Book’s Grand Plan
    8. 0.8 One More Point
  10. PART I: PROLOGUE, AND NEIGHBORHOOD-BASED METHODS
  11. 1 REGRESSION MODELS
    1. 1.1 Example: The Bike Sharing Dataset
      1. 1.1.1 Loading the Data
      2. 1.1.2 A Look Ahead
    2. 1.2 Machine Learning and Prediction
      1. 1.2.1 Predicting Past, Present, and Future
      2. 1.2.2 Statistics vs. Machine Learning in Prediction
    3. 1.3 Introducing the k-Nearest Neighbors Method
      1. 1.3.1 Predicting Bike Ridership with k-NN
    4. 1.4 Dummy Variables and Categorical Variables
    5. 1.5 Analysis with qeKNN()
      1. 1.5.1 Predicting Bike Ridership with qeKNN()
    6. 1.6 The Regression Function: The Basis of ML
    7. 1.7 The Bias-Variance Trade-off
      1. 1.7.1 Analogy to Election Polls
      2. 1.7.2 Back to ML
    8. 1.8 Example: The mlb Dataset
    9. 1.9 k-NN and Categorical Features
    10. 1.10 Scaling
    11. 1.11 Choosing Hyperparameters
      1. 1.11.1 Predicting the Training Data
    12. 1.12 Holdout Sets
      1. 1.12.1 Loss Functions
      2. 1.12.2 Holdout Sets in the qe*-Series
      3. 1.12.3 Motivating Cross-Validation
      4. 1.12.4 Hyperparameters, Dataset Size, and Number of Features
    13. 1.13 Pitfall: p-Hacking and Hyperparameter Selection
    14. 1.14 Pitfall: Long-Term Time Trends
    15. 1.15 Pitfall: Dirty Data
    16. 1.16 Pitfall: Missing Data
    17. 1.17 Direct Access to the regtools k-NN Code
    18. 1.18 Conclusions
  12. 2 CLASSIFICATION MODELS
    1. 2.1 Classification Is a Special Case of Regression
    2. 2.2 Example: The Telco Churn Dataset
      1. 2.2.1 Pitfall: Factor Data Read as Non-factor
      2. 2.2.2 Pitfall: Retaining Useless Features
      3. 2.2.3 Dealing with NA Values
      4. 2.2.4 Applying the k-Nearest Neighbors Method
      5. 2.2.5 Pitfall: Overfitting Due to Features with Many Categories
    3. 2.3 Example: Vertebrae Data
      1. 2.3.1 Analysis
    4. 2.4 Pitfall: Error Rate Improves Only Slightly Using the Features
    5. 2.5 The Confusion Matrix
    6. 2.6 Clearing the Confusion: Unbalanced Data
      1. 2.6.1 Example: The Kaggle Appointments Dataset
      2. 2.6.2 A Better Approach to Unbalanced Data
    7. 2.7 Receiver Operating Characteristic and Area Under Curve
      1. 2.7.1 Details of ROC and AUC
      2. 2.7.2 The qeROC() Function
      3. 2.7.3 Example: Telco Churn Data
      4. 2.7.4 Example: Vertebrae Data
      5. 2.7.5 Pitfall: Overreliance on AUC
    8. 2.8 Conclusions
  13. 3 BIAS, VARIANCE, OVERFITTING, AND CROSS-VALIDATION
    1. 3.1 Overfitting and Underfitting
      1. 3.1.1 Intuition Regarding the Number of Features and Overfitting
      2. 3.1.2 Relation to Overall Dataset Size
      3. 3.1.3 Well Then, What Are the Best Values of k and p?
    2. 3.2 Cross-Validation
      1. 3.2.1 K-Fold Cross-Validation
      2. 3.2.2 Using the replicMeans() Function
      3. 3.2.3 Example: Programmer and Engineer Data
      4. 3.2.4 Triple Cross-Validation
    3. 3.3 Conclusions
  14. 4 DEALING WITH LARGE NUMBERS OF FEATURES
    1. 4.1 Pitfall: Computational Issues in Large Datasets
    2. 4.2 Introduction to Dimension Reduction
      1. 4.2.1 Example: The Million Song Dataset
      2. 4.2.2 The Need for Dimension Reduction
    3. 4.3 Methods for Dimension Reduction
      1. 4.3.1 Consolidation and Embedding
      2. 4.3.2 The All Possible Subsets Method
      3. 4.3.3 Principal Components Analysis
      4. 4.3.4 But Now We Have Two Hyperparameters
      5. 4.3.5 Using the qePCA() Wrapper
      6. 4.3.6 PCs and the Bias-Variance Trade-off
    4. 4.4 The Curse of Dimensionality
    5. 4.5 Other Methods of Dimension Reduction
      1. 4.5.1 Feature Ordering by Conditional Independence
      2. 4.5.2 Uniform Manifold Approximation and Projection
    6. 4.6 Going Further Computationally
    7. 4.7 Conclusions
  15. PART II: TREE-BASED METHODS
  16. 5 A STEP BEYOND K-NN: DECISION TREES
    1. 5.1 Basics of Decision Trees
    2. 5.2 The qeDT() Function
      1. 5.2.1 Looking at the Plot
    3. 5.3 Example: New York City Taxi Data
      1. 5.3.1 Pitfall: Too Many Combinations of Factor Levels
      2. 5.3.2 Tree-Based Analysis
    4. 5.4 Example: Forest Cover Data
    5. 5.5 Decision Tree Hyperparameters: How to Split?
    6. 5.6 Hyperparameters in the qeDT() Function
    7. 5.7 Conclusions
  17. 6 TWEAKING THE TREES
    1. 6.1 Bias vs. Variance, Bagging, and Boosting
    2. 6.2 Bagging: Generating New Trees by Resampling
      1. 6.2.1 Random Forests
      2. 6.2.2 The qeRF() Function
      3. 6.2.3 Example: Vertebrae Data
      4. 6.2.4 Example: Remote-Sensing Soil Analysis
    3. 6.3 Boosting: Repeatedly Tweaking a Tree
      1. 6.3.1 Implementation: AdaBoost
      2. 6.3.2 Gradient Boosting
      3. 6.3.3 Example: Call Network Monitoring
      4. 6.3.4 Example: Vertebrae Data
      5. 6.3.5 Bias vs. Variance in Boosting
      6. 6.3.6 Computational Speed
      7. 6.3.7 Further Hyperparameters
      8. 6.3.8 The Learning Rate
    4. 6.4 Pitfall: No Free Lunch
  18. 7 FINDING A GOOD SET OF HYPERPARAMETERS
    1. 7.1 Combinations of Hyperparameters
    2. 7.2 Grid Searching with qeFT()
      1. 7.2.1 How to Call qeFT()
    3. 7.3 Example: Programmer and Engineer Data
      1. 7.3.1 Confidence Intervals
      2. 7.3.2 The Takeaway on Grid Searching
    4. 7.4 Example: Programmer and Engineer Data
    5. 7.5 Example: Phoneme Data
    6. 7.6 Conclusions
  19. PART III: METHODS BASED ON LINEAR RELATIONSHIPS
  20. 8 PARAMETRIC METHODS
    1. 8.1 Motivating Example: The Baseball Player Data
      1. 8.1.1 A Graph to Guide Our Intuition
      2. 8.1.2 View as Dimension Reduction
    2. 8.2 The lm() Function
    3. 8.3 Wrapper for lm() in the qe*-Series: qeLin()
    4. 8.4 Use of Multiple Features
      1. 8.4.1 Example: Baseball Player, Continued
      2. 8.4.2 Beta Notation
      3. 8.4.3 Example: Airbnb Data
      4. 8.4.4 Applying the Linear Model
    5. 8.5 Dimension Reduction
      1. 8.5.1 Which Features Are Important?
      2. 8.5.2 Statistical Significance and Dimension Reduction
    6. 8.6 Least Squares and Residuals
    7. 8.7 Diagnostics: Is the Linear Model Valid?
      1. 8.7.1 Exactness?
      2. 8.7.2 Diagnostic Methods
    8. 8.8 The R-Squared Value(s)
    9. 8.9 Classification Applications: The Logistic Model
      1. 8.9.1 The glm() and qeLogit() Functions
      2. 8.9.2 Example: Telco Churn Data
      3. 8.9.3 Multiclass Case
      4. 8.9.4 Example: Fall Detection Data
    10. 8.10 Bias and Variance in Linear/Generalized Linear Models
      1. 8.10.1 Example: Bike Sharing Data
    11. 8.11 Polynomial Models
      1. 8.11.1 Motivation
      2. 8.11.2 Modeling Nonlinearity with a Linear Model
      3. 8.11.3 Polynomial Logistic Regression
      4. 8.11.4 Example: Programmer and Engineer Wages
    12. 8.12 Blending the Linear Model with Other Methods
    13. 8.13 The qeCompare() Function
      1. 8.13.1 Need for Caution Regarding Polynomial Models
    14. 8.14 What’s Next
  21. 9 CUTTING THINGS DOWN TO SIZE: REGULARIZATION
    1. 9.1 Motivation
    2. 9.2 Size of a Vector
    3. 9.3 Ridge Regression and the LASSO
      1. 9.3.1 How They Work
      2. 9.3.2 The Bias-Variance Trade-off, Avoiding Overfitting
      3. 9.3.3 Relation Between λ, n, and p
      4. 9.3.4 Comparison, Ridge vs. LASSO
    4. 9.4 Software
    5. 9.5 Example: NYC Taxi Data
    6. 9.6 Example: Airbnb Data
    7. 9.7 Example: African Soil Data
      1. 9.7.1 LASSO Analysis
    8. 9.8 Optional Section: The Famous LASSO Picture
    9. 9.9 Coming Up
  22. PART IV: METHODS BASED ON SEPARATING LINES AND PLANES
  23. 10 A BOUNDARY APPROACH: SUPPORT VECTOR MACHINES
    1. 10.1 Motivation
      1. 10.1.1 Example: The Forest Cover Dataset
    2. 10.2 Lines, Planes, and Hyperplanes
    3. 10.3 Math Notation
      1. 10.3.1 Vector Expressions
      2. 10.3.2 Dot Products
      3. 10.3.3 SVM as a Parametric Model
    4. 10.4 SVM: The Basic Ideas—Separable Case
      1. 10.4.1 Example: The Anderson Iris Dataset
      2. 10.4.2 Optimizing Criterion
    5. 10.5 Major Problem: Lack of Linear Separability
      1. 10.5.1 Applying a “Kernel”
      2. 10.5.2 Soft Margin
    6. 10.6 Example: Forest Cover Data
    7. 10.7 And What About That Kernel Trick?
    8. 10.8 “Warning: Maximum Number of Iterations Reached”
    9. 10.9 Summary
  24. 11 LINEAR MODELS ON STEROIDS: NEURAL NETWORKS
    1. 11.1 Overview
    2. 11.2 Working on Top of a Complex Infrastructure
    3. 11.3 Example: Vertebrae Data
    4. 11.4 Neural Network Hyperparameters
    5. 11.5 Activation Functions
    6. 11.6 Regularization
      1. 11.6.1 L1 and L2 Regularization
      2. 11.6.2 Regularization by Dropout
    7. 11.7 Example: Fall Detection Data
    8. 11.8 Pitfall: Convergence Problems
    9. 11.9 Close Relation to Polynomial Regression
    10. 11.10 Bias vs. Variance in Neural Networks
    11. 11.11 Discussion
  25. PART V: APPLICATIONS
  26. 12 IMAGE CLASSIFICATION
    1. 12.1 Example: The Fashion MNIST Data
      1. 12.1.1 A First Try Using a Logit Model
      2. 12.1.2 Refinement via PCA
    2. 12.2 Convolutional Models
      1. 12.2.1 Need for Recognition of Locality
      2. 12.2.2 Overview of Convolutional Methods
      3. 12.2.3 Image Tiling
      4. 12.2.4 The Convolution Operation
      5. 12.2.5 The Pooling Operation
      6. 12.2.6 Shape Evolution Across Layers
      7. 12.2.7 Dropout
      8. 12.2.8 Summary of Shape Evolution
      9. 12.2.9 Translation Invariance
    3. 12.3 Tricks of the Trade
      1. 12.3.1 Data Augmentation
      2. 12.3.2 Pretrained Networks
    4. 12.4 So, What About the Overfitting Issue?
    5. 12.5 Conclusions
  27. 13 HANDLING TIME SERIES AND TEXT DATA
    1. 13.1 Converting Time Series Data to Rectangular Form
      1. 13.1.1 Toy Example
      2. 13.1.2 The regtools Function TStoX()
    2. 13.2 The qeTS() Function
    3. 13.3 Example: Weather Data
    4. 13.4 Bias vs. Variance
    5. 13.5 Text Applications
      1. 13.5.1 The Bag-of-Words Model
      2. 13.5.2 The qeText() Function
      3. 13.5.3 Example: Quiz Data
      4. 13.5.4 Example: AG News Dataset
    6. 13.6 Summary
  28. A LIST OF ACRONYMS AND SYMBOLS
  29. B STATISTICS AND ML TERMINOLOGY CORRESPONDENCE
  30. C MATRICES, DATA FRAMES, AND FACTOR CONVERSIONS
    1. C.1 Matrices
    2. C.2 Conversions: Between R Factors and Dummy Variables, Between Data Frames and Matrices
  31. D PITFALL: BEWARE OF “P-HACKING”!
  32. INDEX

Product information

  • Title: The Art of Machine Learning
  • Author(s): Norman Matloff
  • Release date: January 2024
  • Publisher(s): No Starch Press
  • ISBN: 9781718502109