The Data Science Workshop

Book description

Cut through the noise and get real results with a step-by-step approach to data science

Key Features

  • Ideal for the data science beginner who is getting started for the first time
  • A data science tutorial with step-by-step exercises and activities that help build key skills
  • Structured to let you progress at your own pace, on your own terms
  • Use your physical print copy to redeem free access to the online interactive edition

Book Description

You already know you want to learn data science, and a smarter way to learn data science is to learn by doing. The Data Science Workshop focuses on building up your practical skills so that you can understand how to develop simple machine learning models in Python or even build an advanced model for detecting potential bank frauds with effective modern data science. You'll learn from real examples that lead to real results.

Throughout The Data Science Workshop, you'll take an engaging step-by-step approach to understanding data science. You won't have to sit through any unnecessary theory. If you're short on time you can jump into a single exercise each day or spend an entire weekend training a model using sci-kit learn. It's your choice. Learning on your terms, you'll build up and reinforce key skills in a way that feels rewarding.

Every physical print copy of The Data Science Workshop unlocks access to the interactive edition. With videos detailing all exercises and activities, you'll always have a guided solution. You can also benchmark yourself against assessments, track progress, and receive content updates. You'll even earn a secure credential that you can share and verify online upon completion. It's a premium learning experience that's included with your printed copy. To redeem, follow the instructions located at the start of your data science book.

Fast-paced and direct, The Data Science Workshop is the ideal companion for data science beginners. You'll learn about machine learning algorithms like a data scientist, learning along the way. This process means that you'll find that your new skills stick, embedded as best practice. A solid foundation for the years ahead.

What you will learn

  • Find out the key differences between supervised and unsupervised learning
  • Manipulate and analyze data using scikit-learn and pandas libraries
  • Learn about different algorithms such as regression, classification, and clustering
  • Discover advanced techniques to improve model ensembling and accuracy
  • Speed up the process of creating new features with automated feature tool
  • Simplify machine learning using open source Python packages

Who this book is for

Our goal at Packt is to help you be successful, in whatever it is you choose to do. The Data Science Workshop is an ideal data science tutorial for the data science beginner who is just getting started. Pick up a Workshop today and let Packt help you develop skills that stick with you for life.

Table of contents

  1. Preface
    1. About the Book
      1. About the Chapters
      2. Conventions
      3. Before You Begin
        1. How to Set Up Google Colab
        2. How to Use Google Colab
      4. Installing the Code Bundle
  2. 1. Introduction to Data Science in Python
    1. Introduction
    2. Application of Data Science
      1. What Is Machine Learning?
        1. Supervised Learning
        2. Unsupervised Learning
        3. Reinforcement Learning
    3. Overview of Python
      1. Types of Variable
        1. Numeric Variables
        2. Text Variables
        3. Python List
        4. Python Dictionary
      2. Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms
    4. Python for Data Science
      1. The pandas Package
        1. DataFrame and Series
        2. CSV Files
        3. Excel Spreadsheets
        4. JSON
      2. Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame
    5. Scikit-Learn
      1. What Is a Model?
        1. Model Hyperparameters
        2. The sklearn API
      2. Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn
      3. Activity 1.01: Train a Spam Detector Algorithm
    6. Summary
  3. 2. Regression
    1. Introduction
    2. Simple Linear Regression
      1. The Method of Least Squares
    3. Multiple Linear Regression
      1. Estimating the Regression Coefficients (β0, β1, β2 and β3)
      2. Logarithmic Transformations of Variables
      3. Correlation Matrices
    4. Conducting Regression Analysis Using Python
      1. Exercise 2.01: Loading and Preparing the Data for Analysis
      2. The Correlation Coefficient
      3. Exercise 2.02: Graphical Investigation of Linear Relationships Using Python
      4. Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python
      5. The Statsmodels formula API
      6. Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API
      7. Analyzing the Model Summary
      8. The Model Formula Language
      9. Intercept Handling
      10. Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels formula API
    5. Multiple Regression Analysis
      1. Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels formula API
    6. Assumptions of Regression Analysis
      1. Activity 2.02: Fitting a Multiple Log-Linear Regression Model
    7. Explaining the Results of Regression Analysis
      1. Regression Analysis Checks and Balances
      2. The F-test
      3. The t-test
    8. Summary
  4. 3. Binary Classification
    1. Introduction
    2. Understanding the Business Context
      1. Business Discovery
      2. Exercise 3.01: Loading and Exploring the Data from the Dataset
      3. Testing Business Hypotheses Using Exploratory Data Analysis
      4. Visualization for Exploratory Data Analysis
      5. Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan
      6. Intuitions from the Exploratory Analysis
      7. Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits
    3. Feature Engineering
      1. Business-Driven Feature Engineering
      2. Exercise 3.03: Feature Engineering – Exploration of Individual Features
      3. Exercise 3.04: Feature Engineering – Creating New Features from Existing Ones
    4. Data-Driven Feature Engineering
      1. A Quick Peek at Data Types and a Descriptive Summary
    5. Correlation Matrix and Visualization
      1. Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data
      2. Skewness of Data
      3. Histograms
      4. Density Plots
      5. Other Feature Engineering Methods
      6. Summarizing Feature Engineering
      7. Building a Binary Classification Model Using the Logistic Regression Function
      8. Logistic Regression Demystified
      9. Metrics for Evaluating Model Performance
      10. Confusion Matrix
      11. Accuracy
      12. Classification Report
      13. Data Preprocessing
      14. Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank
      15. Activity 3.02: Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables
      16. Next Steps
    6. Summary
  5. 4. Multiclass Classification with RandomForest
    1. Introduction
    2. Training a Random Forest Classifier
    3. Evaluating the Model's Performance
      1. Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance
      2. Number of Trees Estimator
      3. Exercise 4.02: Tuning n_estimators to Reduce Overfitting
    4. Maximum Depth
      1. Exercise 4.03: Tuning max_depth to Reduce Overfitting
    5. Minimum Sample in Leaf
      1. Exercise 4.04: Tuning min_samples_leaf
    6. Maximum Features
      1. Exercise 4.05: Tuning max_features
      2. Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset
    7. Summary
  6. 5. Performing Your First Cluster Analysis
    1. Introduction
    2. Clustering with k-means
      1. Exercise 5.01: Performing Your First Clustering Analysis on the ATO Dataset
    3. Interpreting k-means Results
      1. Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses
    4. Choosing the Number of Clusters
      1. Exercise 5.03: Finding the Optimal Number of Clusters
    5. Initializing Clusters
      1. Exercise 5.04: Using Different Initialization Parameters to Achieve a Suitable Outcome
    6. Calculating the Distance to the Centroid
      1. Exercise 5.05: Finding the Closest Centroids in Our Dataset
    7. Standardizing Data
      1. Exercise 5.06: Standardizing the Data from Our Dataset
      2. Activity 5.01: Perform Customer Segmentation Analysis in a Bank Using k-means
    8. Summary
  7. 6. How to Assess Performance
    1. Introduction
    2. Splitting Data
      1. Exercise 6.01: Importing and Splitting Data
    3. Assessing Model Performance for Regression Models
      1. Data Structures – Vectors and Matrices
        1. Scalars
        2. Vectors
        3. Matrices
      2. R2 Score
      3. Exercise 6.02: Computing the R2 Score of a Linear Regression Model
      4. Mean Absolute Error
      5. Exercise 6.03: Computing the MAE of a Model
      6. Exercise 6.04: Computing the Mean Absolute Error of a Second Model
        1. Other Evaluation Metrics
    4. Assessing Model Performance for Classification Models
      1. Exercise 6.05: Creating a Classification Model for Computing Evaluation Metrics
    5. The Confusion Matrix
      1. Exercise 6.06: Generating a Confusion Matrix for the Classification Model
        1. More on the Confusion Matrix
      2. Precision
      3. Exercise 6.07: Computing Precision for the Classification Model
      4. Recall
      5. Exercise 6.08: Computing Recall for the Classification Model
      6. F1 Score
      7. Exercise 6.09: Computing the F1 Score for the Classification Model
      8. Accuracy
      9. Exercise 6.10: Computing Model Accuracy for the Classification Model
      10. Logarithmic Loss
      11. Exercise 6.11: Computing the Log Loss for the Classification Model
    6. Receiver Operating Characteristic Curve
      1. Exercise 6.12: Computing and Plotting ROC Curve for a Binary Classification Problem
    7. Area Under the ROC Curve
      1. Exercise 6.13: Computing the ROC AUC for the Caesarian Dataset
    8. Saving and Loading Models
      1. Exercise 6.14: Saving and Loading a Model
      2. Activity 6.01: Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model
    9. Summary
  8. 7. The Generalization of Machine Learning Models
    1. Introduction
    2. Overfitting
      1. Training on Too Many Features
      2. Training for Too Long
    3. Underfitting
    4. Data
      1. The Ratio for Dataset Splits
      2. Creating Dataset Splits
      3. Exercise 7.01: Importing and Splitting Data
    5. Random State
      1. Exercise 7.02: Setting a Random State When Splitting Data
    6. Cross-Validation
      1. KFold
      2. Exercise 7.03: Creating a Five-Fold Cross-Validation Dataset
      3. Exercise 7.04: Creating a Five-Fold Cross-Validation Dataset Using a Loop for Calls
    7. cross_val_score
      1. Exercise 7.05: Getting the Scores from Five-Fold Cross-Validation
      2. Understanding Estimators That Implement CV
    8. LogisticRegressionCV
      1. Exercise 7.06: Training a Logistic Regression Model Using Cross-Validation
    9. Hyperparameter Tuning with GridSearchCV
      1. Decision Trees
      2. Exercise 7.07: Using Grid Search with Cross-Validation to Find the Best Parameters for a Model
    10. Hyperparameter Tuning with RandomizedSearchCV
      1. Exercise 7.08: Using Randomized Search for Hyperparameter Tuning
    11. Model Regularization with Lasso Regression
      1. Exercise 7.09: Fixing Model Overfitting Using Lasso Regression
    12. Ridge Regression
      1. Exercise 7.10: Fixing Model Overfitting Using Ridge Regression
      2. Activity 7.01: Find an Optimal Model for Predicting the Critical Temperatures of Superconductors
    13. Summary
  9. 8. Hyperparameter Tuning
    1. Introduction
    2. What Are Hyperparameters?
      1. Difference between Hyperparameters and Statistical Model Parameters
      2. Setting Hyperparameters
      3. A Note on Defaults
    3. Finding the Best Hyperparameterization
      1. Exercise 8.01: Manual Hyperparameter Tuning for a k-NN Classifier
      2. Advantages and Disadvantages of a Manual Search
    4. Tuning Using Grid Search
      1. Simple Demonstration of the Grid Search Strategy
    5. GridSearchCV
      1. Tuning using GridSearchCV
        1. Support Vector Machine (SVM) Classifiers
      2. Exercise 8.02: Grid Search Hyperparameter Tuning for an SVM
      3. Advantages and Disadvantages of Grid Search
    6. Random Search
      1. Random Variables and Their Distributions
      2. Simple Demonstration of the Random Search Process
      3. Tuning Using RandomizedSearchCV
      4. Exercise 8.03: Random Search Hyperparameter Tuning for a Random Forest Classifier
      5. Advantages and Disadvantages of a Random Search
      6. Activity 8.01: Is the Mushroom Poisonous?
    7. Summary
  10. 9. Interpreting a Machine Learning Model
    1. Introduction
    2. Linear Model Coefficients
      1. Exercise 9.01: Extracting the Linear Regression Coefficient
    3. RandomForest Variable Importance
      1. Exercise 9.02: Extracting RandomForest Feature Importance
    4. Variable Importance via Permutation
      1. Exercise 9.03: Extracting Feature Importance via Permutation
    5. Partial Dependence Plots
      1. Exercise 9.04: Plotting Partial Dependence
    6. Local Interpretation with LIME
      1. Exercise 9.05: Local Interpretation with LIME
      2. Activity 9.01: Train and Analyze a Network Intrusion Detection Model
    7. Summary
  11. 10. Analyzing a Dataset
    1. Introduction
    2. Exploring Your Data
    3. Analyzing Your Dataset
      1. Exercise 10.01: Exploring the Ames Housing Dataset with Descriptive Statistics
    4. Analyzing the Content of a Categorical Variable
      1. Exercise 10.02: Analyzing the Categorical Variables from the Ames Housing Dataset
    5. Summarizing Numerical Variables
      1. Exercise 10.03: Analyzing Numerical Variables from the Ames Housing Dataset
    6. Visualizing Your Data
      1. How to use the Altair API
      2. Histogram for Numerical Variables
      3. Bar Chart for Categorical Variables
    7. Boxplots
      1. Exercise 10.04: Visualizing the Ames Housing Dataset with Altair
      2. Activity 10.01: Analyzing Churn Data Using Visual Data Analysis Techniques
    8. Summary
  12. 11. Data Preparation
    1. Introduction
    2. Handling Row Duplication
      1. Exercise 11.01: Handling Duplicates in a Breast Cancer Dataset
    3. Converting Data Types
      1. Exercise 11.02: Converting Data Types for the Ames Housing Dataset
    4. Handling Incorrect Values
      1. Exercise 11.03: Fixing Incorrect Values in the State Column
    5. Handling Missing Values
      1. Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset
      2. Activity 11.01: Preparing the Speed Dating Dataset
    6. Summary
  13. 12. Feature Engineering
    1. Introduction
    2. Merging Datasets
      1. The left join
        1. The right join
      2. Exercise 12.01: Merging the ATO Dataset with the Postcode Data
    3. Binning Variables
      1. Exercise 12.02: Binning the YearBuilt variable from the AMES Housing dataset
    4. Manipulating Dates
      1. Exercise 12.03: Date Manipulation on Financial Services Consumer Complaints
    5. Performing Data Aggregation
      1. Exercise 12.04: Feature Engineering Using Data Aggregation on the AMES Housing Dataset
      2. Activity 12.01: Feature Engineering on a Financial Dataset
    6. Summary
  14. 13. Imbalanced Datasets
    1. Introduction
    2. Understanding the Business Context
      1. Exercise 13.01: Benchmarking the Logistic Regression Model on the Dataset
      2. Analysis of the Result
    3. Challenges of Imbalanced Datasets
    4. Strategies for Dealing with Imbalanced Datasets
      1. Collecting More Data
      2. Resampling Data
      3. Exercise 13.02: Implementing Random Undersampling and Classification on Our Banking Dataset to Find the Optimal Result
      4. Analysis
    5. Generating Synthetic Samples
      1. Implementation of SMOTE and MSMOTE
      2. Exercise 13.03: Implementing SMOTE on Our Banking Dataset to Find the Optimal Result
      3. Exercise 13.04: Implementing MSMOTE on Our Banking Dataset to Find the Optimal Result
      4. Applying Balancing Techniques on a Telecom Dataset
      5. Activity 13.01: Finding the Best Balancing Technique by Fitting a Classifier on the Telecom Churn Dataset
    6. Summary
  15. 14. Dimensionality Reduction
    1. Introduction
      1. Business Context
      2. Exercise 14.01: Loading and Cleaning the Dataset
    2. Creating a High-Dimensional Dataset
      1. Activity 14.01: Fitting a Logistic Regression Model on a High‑Dimensional Dataset
    3. Strategies for Addressing High-Dimensional Datasets
      1. Backward Feature Elimination (Recursive Feature Elimination)
      2. Exercise 14.02: Dimensionality Reduction Using Backward Feature Elimination
      3. Forward Feature Selection
      4. Exercise 14.03: Dimensionality Reduction Using Forward Feature Selection
      5. Principal Component Analysis (PCA)
      6. Exercise 14.04: Dimensionality Reduction Using PCA
      7. Independent Component Analysis (ICA)
      8. Exercise 14.05: Dimensionality Reduction Using Independent Component Analysis
      9. Factor Analysis
      10. Exercise 14.06: Dimensionality Reduction Using Factor Analysis
    4. Comparing Different Dimensionality Reduction Techniques
      1. Activity 14.02: Comparison of Dimensionality Reduction Techniques on the Enhanced Ads Dataset
    5. Summary
  16. 15. Ensemble Learning
    1. Introduction
    2. Ensemble Learning
      1. Variance
      2. Bias
      3. Business Context
      4. Exercise 15.01: Loading, Exploring, and Cleaning the Data
      5. Activity 15.01: Fitting a Logistic Regression Model on Credit Card Data
    3. Simple Methods for Ensemble Learning
      1. Averaging
      2. Exercise 15.02: Ensemble Model Using the Averaging Technique
      3. Weighted Averaging
      4. Exercise 15.03: Ensemble Model Using the Weighted Averaging Technique
        1. Iteration 2 with Different Weights
        2. Max Voting
      5. Exercise 15.04: Ensemble Model Using Max Voting
      6. Advanced Techniques for Ensemble Learning
        1. Bagging
      7. Exercise 15.05: Ensemble Learning Using Bagging
      8. Boosting
      9. Exercise 15.06: Ensemble Learning Using Boosting
      10. Stacking
      11. Exercise 15.07: Ensemble Learning Using Stacking
      12. Activity 15.02: Comparison of Advanced Ensemble Techniques
    4. Summary
  17. 16. Machine Learning Pipelines
    1. Introduction
    2. Pipelines
      1. Business Context
      2. Exercise 16.01: Preparing the Dataset to Implement Pipelines
    3. Automating ML Workflows Using Pipeline
      1. Automating Data Preprocessing Using Pipelines
      2. Exercise 16.02: Applying Pipelines for Feature Extraction to the Dataset
    4. ML Pipeline with Processing and Dimensionality Reduction
      1. Exercise 16.03: Adding Dimensionality Reduction to the Feature Extraction Pipeline
    5. ML Pipeline for Modeling and Prediction
      1. Exercise 16.04: Modeling and Predictions Using ML Pipelines
    6. ML Pipeline for Spot-Checking Multiple Models
      1. Exercise 16.05: Spot-Checking Models Using ML Pipelines
    7. ML Pipelines for Identifying the Best Parameters for a Model
      1. Cross-Validation
      2. Grid Search
      3. Exercise 16.06: Grid Search and Cross-Validation with ML Pipelines
    8. Applying Pipelines to a Dataset
      1. Activity 16.01: Complete ML Workflow in a Pipeline
    9. Summary
  18. 17. Automated Feature Engineering
    1. Introduction
    2. Feature Engineering
      1. Automating Feature Engineering Using Feature Tools
      2. Business Context
      3. Domain Story for the Problem Statement
      4. Featuretools – Creating Entities and Relationships
      5. Exercise 17.01: Defining Entities and Establishing Relationships
      6. Feature Engineering – Basic Operations
      7. Featuretools – Automated Feature Engineering
      8. Exercise 17.02: Creating New Features Using Deep Feature Synthesis
      9. Exercise 17.03: Classification Model after Automated Feature Generation
    3. Featuretools on a New Dataset
      1. Activity 17.01: Building a Classification Model with Features that have been Generated Using Featuretools
    4. Summary

Product information

  • Title: The Data Science Workshop
  • Author(s): Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare
  • Release date: January 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781838981266