Machine Learning with R - Fourth Edition

Book description

Use R and tidyverse to prepare, clean, import, visualize, transform, program, communicate, predict and model data No R experience is required, although prior exposure to statistics and programming is helpful Purchase of the print or Kindle book includes a free eBook in PDF format.

Key Features

  • Get to grips with the tidyverse, challenging data, and big data
  • Create clear and concise data and model visualizations that effectively communicate results to stakeholders
  • Solve a variety of problems using regression, ensemble methods, clustering, deep learning, probabilistic models, and more

Book Description

Dive into R with this data science guide on machine learning (ML). Machine Learning with R, Fourth Edition, takes you through classification methods like nearest neighbor and Naive Bayes and regression modeling, from simple linear to logistic.

Dive into practical deep learning with neural networks and support vector machines and unearth valuable insights from complex data sets with market basket analysis. Learn how to unlock hidden patterns within your data using k-means clustering.

With three new chapters on data, you’ll hone your skills in advanced data preparation, mastering feature engineering, and tackling challenging data scenarios. This book helps you conquer high-dimensionality, sparsity, and imbalanced data with confidence. Navigate the complexities of big data with ease, harnessing the power of parallel computing and leveraging GPU resources for faster insights.

Elevate your understanding of model performance evaluation, moving beyond accuracy metrics. With a new chapter on building better learners, you’ll pick up techniques that top teams use to improve model performance with ensemble methods and innovative model stacking and blending techniques.

Machine Learning with R, Fourth Edition, equips you with the tools and knowledge to tackle even the most formidable data challenges. Unlock the full potential of machine learning and become a true master of the craft.

What you will learn

  • Learn the end-to-end process of machine learning from raw data to implementation
  • Classify important outcomes using nearest neighbor and Bayesian methods
  • Predict future events using decision trees, rules, and support vector machines
  • Forecast numeric data and estimate financial values using regression methods
  • Model complex processes with artificial neural networks
  • Prepare, transform, and clean data using the tidyverse
  • Evaluate your models and improve their performance
  • Connect R to SQL databases and emerging big data technologies such as Spark, Hadoop, H2O, and TensorFlow

Who this book is for

This book is designed to help data scientists, actuaries, data analysts, financial analysts, social scientists, business and machine learning students, and any other practitioners who want a clear, accessible guide to machine learning with R. No R experience is required, although prior exposure to statistics and programming is helpful.

Table of contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. What you need for this book
    4. Get in touch
  2. Introducing Machine Learning
    1. The origins of machine learning
    2. Uses and abuses of machine learning
      1. Machine learning successes
      2. The limits of machine learning
      3. Machine learning ethics
    3. How machines learn
      1. Data storage
      2. Abstraction
      3. Generalization
      4. Evaluation
    4. Machine learning in practice
      1. Types of input data
      2. Types of machine learning algorithms
      3. Matching input data to algorithms
    5. Machine learning with R
      1. Installing R packages
      2. Loading and unloading R packages
      3. Installing RStudio
      4. Why R and why R now?
    6. Summary
  3. Managing and Understanding Data
    1. R data structures
      1. Vectors
      2. Factors
      3. Lists
      4. Data frames
      5. Matrices and arrays
    2. Managing data with R
      1. Saving, loading, and removing R data structures
      2. Importing and saving datasets from CSV files
      3. Importing common dataset formats using RStudio
    3. Exploring and understanding data
      1. Exploring the structure of data
      2. Exploring numeric features
        1. Measuring the central tendency – mean and median
        2. Measuring spread – quartiles and the five-number summary
        3. Visualizing numeric features – boxplots
        4. Visualizing numeric features – histograms
        5. Understanding numeric data – uniform and normal distributions
        6. Measuring spread – variance and standard deviation
      3. Exploring categorical features
        1. Measuring the central tendency – the mode
      4. Exploring relationships between features
        1. Visualizing relationships – scatterplots
        2. Examining relationships – two-way cross-tabulations
    4. Summary
  4. Lazy Learning – Classification Using Nearest Neighbors
    1. Understanding nearest neighbor classification
      1. The k-NN algorithm
        1. Measuring similarity with distance
        2. Choosing an appropriate k
        3. Preparing data for use with k-NN
      2. Why is the k-NN algorithm lazy?
    2. Example – diagnosing breast cancer with the k-NN algorithm
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
        1. Transformation – normalizing numeric data
        2. Data preparation – creating training and test datasets
      3. Step 3 – training a model on the data
      4. Step 4 – evaluating model performance
      5. Step 5 – improving model performance
        1. Transformation – z-score standardization
        2. Testing alternative values of k
    3. Summary
  5. Probabilistic Learning – Classification Using Naive Bayes
    1. Understanding Naive Bayes
      1. Basic concepts of Bayesian methods
        1. Understanding probability
        2. Understanding joint probability
        3. Computing conditional probability with Bayes’ theorem
      2. The Naive Bayes algorithm
        1. Classification with Naive Bayes
        2. The Laplace estimator
        3. Using numeric features with Naive Bayes
    2. Example – filtering mobile phone spam with the Naive Bayes algorithm
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
        1. Data preparation – cleaning and standardizing text data
        2. Data preparation – splitting text documents into words
        3. Data preparation – creating training and test datasets
        4. Visualizing text data – word clouds
        5. Data preparation – creating indicator features for frequent words
      3. Step 3 – training a model on the data
      4. Step 4 – evaluating model performance
      5. Step 5 – improving model performance
    3. Summary
  6. Divide and Conquer – Classification Using Decision Trees and Rules
    1. Understanding decision trees
      1. Divide and conquer
      2. The C5.0 decision tree algorithm
        1. Choosing the best split
        2. Pruning the decision tree
    2. Example – identifying risky bank loans using C5.0 decision trees
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
        1. Data preparation – creating random training and test datasets
      3. Step 3 – training a model on the data
      4. Step 4 – evaluating model performance
      5. Step 5 – improving model performance
        1. Boosting the accuracy of decision trees
        2. Making some mistakes cost more than others
    3. Understanding classification rules
      1. Separate and conquer
      2. The 1R algorithm
      3. The RIPPER algorithm
      4. Rules from decision trees
      5. What makes trees and rules greedy?
    4. Example – identifying poisonous mushrooms with rule learners
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
      3. Step 3 – training a model on the data
      4. Step 4 – evaluating model performance
      5. Step 5 – improving model performance
    5. Summary
  7. Forecasting Numeric Data – Regression Methods
    1. Understanding regression
      1. Simple linear regression
      2. Ordinary least squares estimation
      3. Correlations
      4. Multiple linear regression
      5. Generalized linear models and logistic regression
    2. Example – predicting auto insurance claims costs using linear regression
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
        1. Exploring relationships between features – the correlation matrix
        2. Visualizing relationships between features – the scatterplot matrix
      3. Step 3 – training a model on the data
      4. Step 4 – evaluating model performance
      5. Step 5 – improving model performance
        1. Model specification – adding nonlinear relationships
        2. Model specification – adding interaction effects
        3. Putting it all together – an improved regression model
        4. Making predictions with a regression model
      6. Going further – predicting insurance policyholder churn with logistic regression
    3. Understanding regression trees and model trees
      1. Adding regression to trees
    4. Example – estimating the quality of wines with regression trees and model trees
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
      3. Step 3 – training a model on the data
        1. Visualizing decision trees
      4. Step 4 – evaluating model performance
        1. Measuring performance with the mean absolute error
      5. Step 5 – improving model performance
    5. Summary
  8. Black-Box Methods – Neural Networks and Support Vector Machines
    1. Understanding neural networks
      1. From biological to artificial neurons
      2. Activation functions
      3. Network topology
        1. The number of layers
        2. The direction of information travel
        3. The number of nodes in each layer
      4. Training neural networks with backpropagation
    2. Example – modeling the strength of concrete with ANNs
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
      3. Step 3 – training a model on the data
      4. Step 4 – evaluating model performance
      5. Step 5 – improving model performance
    3. Understanding support vector machines
      1. Classification with hyperplanes
        1. The case of linearly separable data
        2. The case of nonlinearly separable data
      2. Using kernels for nonlinear spaces
    4. Example – performing OCR with SVMs
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
      3. Step 3 – training a model on the data
      4. Step 4 – evaluating model performance
      5. Step 5 – improving model performance
        1. Changing the SVM kernel function
        2. Identifying the best SVM cost parameter
    5. Summary
  9. Finding Patterns – Market Basket Analysis Using Association Rules
    1. Understanding association rules
      1. The Apriori algorithm for association rule learning
      2. Measuring rule interest – support and confidence
      3. Building a set of rules with the Apriori principle
    2. Example – identifying frequently purchased groceries with association rules
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
        1. Data preparation – creating a sparse matrix for transaction data
        2. Visualizing item support – item frequency plots
        3. Visualizing the transaction data – plotting the sparse matrix
      3. Step 3 – training a model on the data
      4. Step 4 – evaluating model performance
      5. Step 5 – improving model performance
        1. Sorting the set of association rules
        2. Taking subsets of association rules
        3. Saving association rules to a file or data frame
        4. Using the Eclat algorithm for greater efficiency
    3. Summary
  10. Finding Groups of Data – Clustering with k-means
    1. Understanding clustering
      1. Clustering as a machine learning task
      2. Clusters of clustering algorithms
      3. The k-means clustering algorithm
        1. Using distance to assign and update clusters
        2. Choosing the appropriate number of clusters
    2. Finding teen market segments using k-means clustering
      1. Step 1 – collecting data
      2. Step 2 – exploring and preparing the data
        1. Data preparation – dummy coding missing values
        2. Data preparation – imputing the missing values
      3. Step 3 – training a model on the data
      4. Step 4 – evaluating model performance
      5. Step 5 – improving model performance
    3. Summary
  11. Evaluating Model Performance
    1. Measuring performance for classification
      1. Understanding a classifier’s predictions
      2. A closer look at confusion matrices
      3. Using confusion matrices to measure performance
      4. Beyond accuracy – other measures of performance
        1. The kappa statistic
        2. The Matthews correlation coefficient
        3. Sensitivity and specificity
        4. Precision and recall
        5. The F-measure
      5. Visualizing performance tradeoffs with ROC curves
        1. Comparing ROC curves
        2. The area under the ROC curve
        3. Creating ROC curves and computing AUC in R
    2. Estimating future performance
      1. The holdout method
      2. Cross-validation
      3. Bootstrap sampling
    3. Summary
  12. Being Successful with Machine Learning
    1. What makes a successful machine learning practitioner?
    2. What makes a successful machine learning model?
      1. Avoiding obvious predictions
      2. Conducting fair evaluations
      3. Considering real-world impacts
      4. Building trust in the model
    3. Putting the “science” in data science
      1. Using R Notebooks and R Markdown
      2. Performing advanced data exploration
        1. Constructing a data exploration roadmap
        2. Encountering outliers: a real-world pitfall
        3. Example – using ggplot2 for visual data exploration
    4. Summary
  13. Advanced Data Preparation
    1. Performing feature engineering
      1. The role of human and machine
      2. The impact of big data and deep learning
    2. Feature engineering in practice
      1. Hint 1: Brainstorm new features
      2. Hint 2: Find insights hidden in text
      3. Hint 3: Transform numeric ranges
      4. Hint 4: Observe neighbors’ behavior
      5. Hint 5: Utilize related rows
      6. Hint 6: Decompose time series
      7. Hint 7: Append external data
    3. Exploring R’s tidyverse
      1. Making tidy table structures with tibbles
      2. Reading rectangular files faster with readr and readxl
      3. Preparing and piping data with dplyr
      4. Transforming text with stringr
      5. Cleaning dates with lubridate
    4. Summary
  14. Challenging Data – Too Much, Too Little, Too Complex
    1. The challenge of high-dimension data
      1. Applying feature selection
        1. Filter methods
        2. Wrapper methods and embedded methods
        3. Example – Using stepwise regression for feature selection
        4. Example – Using Boruta for feature selection
      2. Performing feature extraction
        1. Understanding principal component analysis
        2. Example – Using PCA to reduce highly dimensional social media data
    2. Making use of sparse data
      1. Identifying sparse data
      2. Example – Remapping sparse categorical data
      3. Example – Binning sparse numeric data
    3. Handling missing data
      1. Understanding types of missing data
      2. Performing missing value imputation
        1. Simple imputation with missing value indicators
        2. Missing value patterns
    4. The problem of imbalanced data
      1. Simple strategies for rebalancing data
      2. Generating a synthetic balanced dataset with SMOTE
        1. Example – Applying the SMOTE algorithm in R
      3. Considering whether balanced is always better
    5. Summary
  15. Building Better Learners
    1. Tuning stock models for better performance
      1. Determining the scope of hyperparameter tuning
      2. Example – using caret for automated tuning
        1. Creating a simple tuned model
        2. Customizing the tuning process
    2. Improving model performance with ensembles
      1. Understanding ensemble learning
      2. Popular ensemble-based algorithms
        1. Bagging
        2. Boosting
        3. Random forests
        4. Gradient boosting
        5. Extreme gradient boosting with XGBoost
        6. Why are tree-based ensembles so popular?
    3. Stacking models for meta-learning
      1. Understanding model stacking and blending
      2. Practical methods for blending and stacking in R
    4. Summary
  16. Making Use of Big Data
    1. Practical applications of deep learning
      1. Beginning with deep learning
        1. Choosing appropriate tasks for deep learning
        2. The TensorFlow and Keras deep learning frameworks
      2. Understanding convolutional neural networks
        1. Transfer learning and fine tuning
        2. Example – classifying images using a pre-trained CNN in R
    2. Unsupervised learning and big data
      1. Representing highly dimensional concepts as embeddings
        1. Understanding word embeddings
        2. Example – using word2vec for understanding text in R
      2. Visualizing highly dimensional data
        1. The limitations of using PCA for big data visualization
        2. Understanding the t-SNE algorithm
        3. Example – visualizing data’s natural clusters with t-SNE
    3. Adapting R to handle large datasets
      1. Querying data in SQL databases
        1. The tidy approach to managing database connections
        2. Using a database backend for dplyr with dbplyr
      2. Doing work faster with parallel processing
        1. Measuring R’s execution time
        2. Enabling parallel processing in R
        3. Taking advantage of parallel with foreach and doParallel
        4. Training and evaluating models in parallel with caret
      3. Utilizing specialized hardware and algorithms
        1. Parallel computing with MapReduce concepts via Apache Spark
        2. Learning via distributed and scalable algorithms with H2O
        3. GPU computing
    4. Summary
  17. Other Books You May Enjoy
  18. Index

Product information

  • Title: Machine Learning with R - Fourth Edition
  • Author(s): Brett Lantz
  • Release date: May 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781801071321