Book description
Cut through the noise and get real results with a step-by-step approach to data science
Key Features
- Ideal for the data science beginner who is getting started for the first time
- A data science tutorial with step-by-step exercises and activities that help build key skills
- Structured to let you progress at your own pace, on your own terms
- Use your physical print copy to redeem free access to the online interactive edition
Book Description
You already know you want to learn data science, and a smarter way to learn data science is to learn by doing. The Data Science Workshop focuses on building up your practical skills so that you can understand how to develop simple machine learning models in Python or even build an advanced model for detecting potential bank frauds with effective modern data science. You'll learn from real examples that lead to real results.
Throughout The Data Science Workshop, you'll take an engaging step-by-step approach to understanding data science. You won't have to sit through any unnecessary theory. If you're short on time you can jump into a single exercise each day or spend an entire weekend training a model using sci-kit learn. It's your choice. Learning on your terms, you'll build up and reinforce key skills in a way that feels rewarding.
Every physical print copy of The Data Science Workshop unlocks access to the interactive edition. With videos detailing all exercises and activities, you'll always have a guided solution. You can also benchmark yourself against assessments, track progress, and receive content updates. You'll even earn a secure credential that you can share and verify online upon completion. It's a premium learning experience that's included with your printed copy. To redeem, follow the instructions located at the start of your data science book.
Fast-paced and direct, The Data Science Workshop is the ideal companion for data science beginners. You'll learn about machine learning algorithms like a data scientist, learning along the way. This process means that you'll find that your new skills stick, embedded as best practice. A solid foundation for the years ahead.
What you will learn
- Find out the key differences between supervised and unsupervised learning
- Manipulate and analyze data using scikit-learn and pandas libraries
- Learn about different algorithms such as regression, classification, and clustering
- Discover advanced techniques to improve model ensembling and accuracy
- Speed up the process of creating new features with automated feature tool
- Simplify machine learning using open source Python packages
Who this book is for
Our goal at Packt is to help you be successful, in whatever it is you choose to do. The Data Science Workshop is an ideal data science tutorial for the data science beginner who is just getting started. Pick up a Workshop today and let Packt help you develop skills that stick with you for life.
Table of contents
- Preface
- 1. Introduction to Data Science in Python
-
2. Regression
- Introduction
- Simple Linear Regression
- Multiple Linear Regression
-
Conducting Regression Analysis Using Python
- Exercise 2.01: Loading and Preparing the Data for Analysis
- The Correlation Coefficient
- Exercise 2.02: Graphical Investigation of Linear Relationships Using Python
- Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python
- The Statsmodels formula API
- Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API
- Analyzing the Model Summary
- The Model Formula Language
- Intercept Handling
- Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels formula API
- Multiple Regression Analysis
- Assumptions of Regression Analysis
- Explaining the Results of Regression Analysis
- Summary
-
3. Binary Classification
- Introduction
-
Understanding the Business Context
- Business Discovery
- Exercise 3.01: Loading and Exploring the Data from the Dataset
- Testing Business Hypotheses Using Exploratory Data Analysis
- Visualization for Exploratory Data Analysis
- Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan
- Intuitions from the Exploratory Analysis
- Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits
- Feature Engineering
- Data-Driven Feature Engineering
-
Correlation Matrix and Visualization
- Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data
- Skewness of Data
- Histograms
- Density Plots
- Other Feature Engineering Methods
- Summarizing Feature Engineering
- Building a Binary Classification Model Using the Logistic Regression Function
- Logistic Regression Demystified
- Metrics for Evaluating Model Performance
- Confusion Matrix
- Accuracy
- Classification Report
- Data Preprocessing
- Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank
- Activity 3.02: Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables
- Next Steps
- Summary
- 4. Multiclass Classification with RandomForest
- 5. Performing Your First Cluster Analysis
-
6. How to Assess Performance
- Introduction
- Splitting Data
- Assessing Model Performance for Regression Models
- Assessing Model Performance for Classification Models
-
The Confusion Matrix
- Exercise 6.06: Generating a Confusion Matrix for the Classification Model
- Precision
- Exercise 6.07: Computing Precision for the Classification Model
- Recall
- Exercise 6.08: Computing Recall for the Classification Model
- F1 Score
- Exercise 6.09: Computing the F1 Score for the Classification Model
- Accuracy
- Exercise 6.10: Computing Model Accuracy for the Classification Model
- Logarithmic Loss
- Exercise 6.11: Computing the Log Loss for the Classification Model
- Receiver Operating Characteristic Curve
- Area Under the ROC Curve
- Saving and Loading Models
- Summary
- 7. The Generalization of Machine Learning Models
- 8. Hyperparameter Tuning
- 9. Interpreting a Machine Learning Model
- 10. Analyzing a Dataset
- 11. Data Preparation
- 12. Feature Engineering
-
13. Imbalanced Datasets
- Introduction
- Understanding the Business Context
- Challenges of Imbalanced Datasets
- Strategies for Dealing with Imbalanced Datasets
-
Generating Synthetic Samples
- Implementation of SMOTE and MSMOTE
- Exercise 13.03: Implementing SMOTE on Our Banking Dataset to Find the Optimal Result
- Exercise 13.04: Implementing MSMOTE on Our Banking Dataset to Find the Optimal Result
- Applying Balancing Techniques on a Telecom Dataset
- Activity 13.01: Finding the Best Balancing Technique by Fitting a Classifier on the Telecom Churn Dataset
- Summary
-
14. Dimensionality Reduction
- Introduction
- Creating a High-Dimensional Dataset
-
Strategies for Addressing High-Dimensional Datasets
- Backward Feature Elimination (Recursive Feature Elimination)
- Exercise 14.02: Dimensionality Reduction Using Backward Feature Elimination
- Forward Feature Selection
- Exercise 14.03: Dimensionality Reduction Using Forward Feature Selection
- Principal Component Analysis (PCA)
- Exercise 14.04: Dimensionality Reduction Using PCA
- Independent Component Analysis (ICA)
- Exercise 14.05: Dimensionality Reduction Using Independent Component Analysis
- Factor Analysis
- Exercise 14.06: Dimensionality Reduction Using Factor Analysis
- Comparing Different Dimensionality Reduction Techniques
- Summary
-
15. Ensemble Learning
- Introduction
- Ensemble Learning
-
Simple Methods for Ensemble Learning
- Averaging
- Exercise 15.02: Ensemble Model Using the Averaging Technique
- Weighted Averaging
- Exercise 15.03: Ensemble Model Using the Weighted Averaging Technique
- Exercise 15.04: Ensemble Model Using Max Voting
- Advanced Techniques for Ensemble Learning
- Exercise 15.05: Ensemble Learning Using Bagging
- Boosting
- Exercise 15.06: Ensemble Learning Using Boosting
- Stacking
- Exercise 15.07: Ensemble Learning Using Stacking
- Activity 15.02: Comparison of Advanced Ensemble Techniques
- Summary
-
16. Machine Learning Pipelines
- Introduction
- Pipelines
- Automating ML Workflows Using Pipeline
- ML Pipeline with Processing and Dimensionality Reduction
- ML Pipeline for Modeling and Prediction
- ML Pipeline for Spot-Checking Multiple Models
- ML Pipelines for Identifying the Best Parameters for a Model
- Applying Pipelines to a Dataset
- Summary
-
17. Automated Feature Engineering
- Introduction
-
Feature Engineering
- Automating Feature Engineering Using Feature Tools
- Business Context
- Domain Story for the Problem Statement
- Featuretools – Creating Entities and Relationships
- Exercise 17.01: Defining Entities and Establishing Relationships
- Feature Engineering – Basic Operations
- Featuretools – Automated Feature Engineering
- Exercise 17.02: Creating New Features Using Deep Feature Synthesis
- Exercise 17.03: Classification Model after Automated Feature Generation
- Featuretools on a New Dataset
- Summary
Product information
- Title: The Data Science Workshop
- Author(s):
- Release date: January 2020
- Publisher(s): Packt Publishing
- ISBN: 9781838981266
You might also like
book
The Data Science Workshop - Second Edition
Gain expert guidance on how to successfully develop machine learning models in Python and build your …
book
The Applied Data Science Workshop - Second Edition
Designed with beginners in mind, this workshop helps you make the most of Python libraries and …
book
Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python
Now a leader of Northwestern University's prestigious analytics program presents a fully-integrated treatment of both the …
book
Beginning Data Science in R: Data Analysis, Visualization, and Modelling for the Data Scientist
Discover best practices for data analysis and software development in R and start on the path …