Hands-On Gradient Boosting with XGBoost and scikit-learn

Book description

Get to grips with building robust XGBoost models using Python and scikit-learn for deployment

Key Features

  • Get up and running with machine learning and understand how to boost models with XGBoost in no time
  • Build real-world machine learning pipelines and fine-tune hyperparameters to achieve optimal results
  • Discover tips and tricks and gain innovative insights from XGBoost Kaggle winners

Book Description

XGBoost is an industry-proven, open-source software library that provides a gradient boosting framework for scaling billions of data points quickly and efficiently.

The book introduces machine learning and XGBoost in scikit-learn before building up to the theory behind gradient boosting. You’ll cover decision trees and analyze bagging in the machine learning context, learning hyperparameters that extend to XGBoost along the way. You’ll build gradient boosting models from scratch and extend gradient boosting to big data while recognizing speed limitations using timers. Details in XGBoost are explored with a focus on speed enhancements and deriving parameters mathematically. With the help of detailed case studies, you’ll practice building and fine-tuning XGBoost classifiers and regressors using scikit-learn and the original Python API. You'll leverage XGBoost hyperparameters to improve scores, correct missing values, scale imbalanced datasets, and fine-tune alternative base learners. Finally, you'll apply advanced XGBoost techniques like building non-correlated ensembles, stacking models, and preparing models for industry deployment using sparse matrices, customized transformers, and pipelines.

By the end of the book, you’ll be able to build high-performing machine learning models using XGBoost with minimal errors and maximum speed.

What you will learn

  • Build gradient boosting models from scratch
  • Develop XGBoost regressors and classifiers with accuracy and speed
  • Analyze variance and bias in terms of fine-tuning XGBoost hyperparameters
  • Automatically correct missing values and scale imbalanced data
  • Apply alternative base learners like dart, linear models, and XGBoost random forests
  • Customize transformers and pipelines to deploy XGBoost models
  • Build non-correlated ensembles and stack XGBoost models to increase accuracy

Who this book is for

This book is for data science professionals and enthusiasts, data analysts, and developers who want to build fast and accurate machine learning models that scale with big data. Proficiency in Python, along with a basic understanding of linear algebra, will help you to get the most out of this book.

Table of contents

  1. Hands-On Gradient Boosting with XGBoost and scikit-learn
  2. Why subscribe?
  3. Contributors
  4. About the author
  5. Foreword
  6. About the reviewers
  7. Packt is searching for authors like you
  8. Preface
    1. Who this book is for
      1. What this book covers
    2. To get the most out of this book
    3. Setting up your coding environment
      1. Anaconda
      2. Using Jupyter notebooks
      3. XGBoost
      4. Versions
      5. Accessing code files
    4. Download the color images
    5. Conventions used
    6. Get in touch
    7. Reviews
  9. Section 1: Bagging and Boosting
  10. Chapter 1: Machine Learning Landscape
    1. Previewing XGBoost
      1. What is machine learning?
    2. Data wrangling
      1. Dataset 1 – Bike rentals
      2. Understanding the data
      3. Correcting null values
    3. Predicting regression
      1. Predicting bike rentals
      2. Saving data for future use
      3. Declaring predictor and target columns
      4. Understanding regression
      5. Accessing scikit-learn
      6. Silencing warnings
      7. Modeling linear regression
      8. XGBoost
      9. XGBRegressor
      10. Cross-validation
    4. Predicting classification
      1. What is classification?
      2. Dataset 2 – The census
      3. Data wrangling
      4. Logistic regression
      5. The XGBoost classifier
    5. Summary
  11. Chapter 2: Decision Trees in Depth
    1. Introducing decision trees with XGBoost
    2. Exploring decision trees
      1. First decision tree model
      2. Inside a decision tree
    3. Contrasting variance and bias
    4. Tuning decision tree hyperparameters
      1. Decision Tree regressor
      2. Hyperparameters in general
      3. Putting it all together
    5. Predicting heart disease – a case study
      1. Heart Disease dataset
      2. Decision Tree classifier
      3. Choosing hyperparameters
      4. Narrowing the range
      5. feature_importances_
    6. Summary
  12. Chapter 3: Bagging with Random Forests
    1. Technical requirements
    2. Bagging ensembles
      1. Ensemble methods
      2. Bootstrap aggregation
    3. Exploring random forests
      1. Random forest classifiers
      2. Random forest regressors
    4. Random forest hyperparameters
      1. oob_score
      2. n_estimators
      3. warm_start
      4. bootstrap
      5. Verbose
      6. Decision Tree hyperparameters
    5. Pushing random forest boundaries – case study
      1. Preparing the dataset
      2. n_estimators
      3. cross_val_score
      4. Fine-tuning hyperparameters
      5. Random forest drawbacks
    6. Summary
  13. Chapter 4: From Gradient Boosting to XGBoost
    1. Technical requirements
    2. From bagging to boosting
      1. Introducing AdaBoost
      2. Distinguishing gradient boosting
    3. How gradient boosting works
      1. Residuals
      2. Learning how to build gradient boosting models from scratch
      3. Building a gradient boosting model in scikit-learn
    4. Modifying gradient boosting hyperparameters
      1. learning_rate
      2. Base learner
      3. subsample
      4. RandomizedSearchCV
      5. XGBoost
    5. Approaching big data – gradient boosting versus XGBoost
      1. Introducing the exoplanet dataset
      2. Preprocessing the exoplanet dataset
      3. Building gradient boosting classifiers
      4. Timing models
      5. Comparing speed
    6. Summary
  14. Section 2: XGBoost
  15. Chapter 5: XGBoost Unveiled
    1. Designing XGBoost
      1. Historical narrative
      2. Design features
    2. Analyzing XGBoost parameters
      1. Learning objective
    3. Building XGBoost models
      1. The Iris dataset
      2. The Diabetes dataset
    4. Finding the Higgs boson – case study
      1. Physics background
      2. Kaggle competitions
      3. XGBoost and the Higgs challenge
      4. Data
      5. Scoring
      6. Weights
      7. The model
    5. Summary
  16. Chapter 6: XGBoost Hyperparameters
    1. Technical requirements
    2. Preparing data and base models
      1. The heart disease dataset
      2. XGBClassifier
      3. StratifiedKFold
      4. Baseline model
      5. Combining GridSearchCV and RandomizedSearchCV
    3. Tuning XGBoost hyperparameters
      1. Applying XGBoost hyperparameters
      2. n_estimators
      3. learning_rate
      4. max_depth
      5. gamma
      6. min_child_weight
      7. subsample
      8. colsample_bytree
    4. Applying early stopping
      1. What is early stopping?
      2. eval_set and eval_metric
      3. early_stopping_rounds
    5. Combining hyperparameters
      1. One hyperparameter at a time
      2. Hyperparameter adjustments
    6. Summary
  17. Chapter 7: Discovering Exoplanets with XGBoost
    1. Technical requirements
    2. Searching for exoplanets
      1. Historical background
      2. The Exoplanet dataset
      3. Graphing the data
      4. Preparing data
      5. Initial XGBClassifier
    3. Analyzing the confusion matrix
      1. confusion_matrix
      2. classification_report
      3. Alternative scoring methods
    4. Resampling imbalanced data
      1. Resampling
      2. Undersampling
      3. Oversampling
    5. Tuning and scaling XGBClassifier
      1. Adjusting weights
      2. Tuning XGBClassifier
      3. Consolidating results
      4. Analyzing results
    6. Summary
  18. Section 3: Advanced XGBoost
  19. Chapter 8: XGBoost Alternative Base Learners
    1. Technical requirements
    2. Exploring alternative base learners
      1. gblinear
      2. DART
      3. XGBoost random forests
    3. Applying gblinear
      1. Applying gblinear to the Diabetes dataset
      2. Linear datasets
      3. Analyzing gblinear
    4. Comparing dart
      1. DART with XGBRegressor
      2. dart with XGBClassifier
      3. DART hyperparameters
      4. Modifying dart hyperparameters
      5. Analyzing dart
    5. Finding XGBoost random forests
      1. Random forests as base learners
      2. Random forests as XGBoost models
      3. Analyzing XGBoost random forests
    6. Summary
  20. Chapter 9: XGBoost Kaggle Masters
    1. Technical requirements
    2. Exploring Kaggle competitions
      1. XGBoost in Kaggle competitions
      2. The structure of Kaggle competitions
      3. Hold-out sets
    3. Engineering new columns
      1. What is feature engineering?
      2. Uber and Lyft data
    4. Building non-correlated ensembles
      1. Range of models
      2. Correlation
      3. Correlation in machine learning ensembles
      4. The VotingClassifier ensemble
    5. Stacking models
      1. What is stacking?
      2. Stacking in scikit-learn
    6. Summary
  21. Chapter 10: XGBoost Model Deployment
    1. Technical requirements
    2. Encoding mixed data
      1. Loading data
      2. Clearing null values
      3. One-hot encoding
      4. Combining a one-hot encoded matrix and numerical columns
    3. Customizing scikit-learn transformers
      1. Customizing transformers
      2. Preprocessing pipeline
    4. Finalizing an XGBoost model
      1. First XGBoost model
      2. Fine-tuning the XGBoost hyperparameters
      3. Testing model
    5. Building a machine learning pipeline
    6. Summary
  22. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Hands-On Gradient Boosting with XGBoost and scikit-learn
  • Author(s): Corey Wade
  • Release date: October 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781839218354