The Data Analysis Workshop

Book description

Learn how to analyze data using Python models with the help of real-world use cases and guidance from industry experts

Key Features

  • Get to grips with data analysis by studying use cases from different fields
  • Develop your critical thinking skills by following tried-and-true data analysis
  • Learn how to use conclusions from data analyses to make better business decisions

Book Description

Businesses today operate online and generate data almost continuously. While not all data in its raw form may seem useful, if processed and analyzed correctly, it can provide you with valuable hidden insights. The Data Analysis Workshop will help you learn how to discover these hidden patterns in your data, to analyze them, and leverage the results to help transform your business.

The book begins by taking you through the use case of a bike rental shop. You'll be shown how to correlate data, plot histograms, and analyze temporal features. As you progress, you'll learn how to plot data for a hydraulic system using the Seaborn and Matplotlib libraries, and explore a variety of use cases that show you how to join and merge databases, prepare data for analysis, and handle imbalanced data.

By the end of the book, you'll have learned different data analysis techniques, including hypothesis testing, correlation, and null-value imputation, and will have become a confident data analyst.

What you will learn

  • Get to grips with the fundamental concepts and conventions of data analysis
  • Understand how different algorithms help you to analyze the data effectively
  • Determine the variation between groups of data using hypothesis testing
  • Visualize your data correctly using appropriate plotting points
  • Use correlation techniques to uncover the relationship between variables
  • Find hidden patterns in data using advanced techniques and strategies

Who this book is for

The Data Analysis Workshop is for programmers who already know how to code in Python and want to use it to perform data analysis. If you are looking to gain practical experience in data science with Python, this book is for you.

Table of contents

  1. The Data Analysis Workshop
  2. Preface
    1. About the Book
      1. Audience
      2. About the Chapters
      3. Conventions
      4. Code Presentation
      5. Setting up Your Environment
      6. Installation and Setup
      7. Installing Libraries
      8. Accessing the Code Files
  3. 1. Bike Sharing Analysis
    1. Introduction
    2. Understanding the Data
    3. Data Preprocessing
      1. Exercise 1.01: Preprocessing Temporal and Weather Features
      2. Registered versus Casual Use Analysis
      3. Exercise 1.02: Analyzing Seasonal Impact on Rides
    4. Hypothesis Tests
      1. Exercise 1.03: Estimating Average Registered Rides
      2. Exercise 1.04: Hypothesis Testing on Registered Rides
    5. Analysis of Weather-Related Features
      1. Exercise 1.05: Evaluating the Difference between the Pearson and Spearman Correlations
        1. Correlation Matrix Plot
    6. Time Series Analysis
      1. Exercise 1.06: Time Series Decomposition in Trend, Seasonality, and Residual Components
    7. ARIMA Models
      1. Exercise 1.07: ACF and PACF Plots for Registered Rides
      2. Activity 1.01: Investigating the Impact of Weather Conditions on Rides
    8. Summary
  4. 2. Absenteeism at Work
    1. Introduction
    2. Initial Data Analysis
      1. Exercise 2.01: Identifying Reasons for Absence
    3. Initial Analysis of the Reason for Absence
    4. Analysis of Social Drinkers and Smokers
      1. Exercise 2.02: Identifying Reasons of Absence with Higher Probability Among Drinkers and Smokers
      2. Exercise 2.03: Identifying the Probability of Being a Drinker/Smoker, Conditioned to Absence Reason
    5. Body Mass Index
    6. Age and Education Factors
      1. Exercise 2.04: Investigating the Impact of Age on Reason for Absence
      2. Exercise 2.05: Investigating the Impact of Education on Reason for Absence
    7. Transportation Costs and Distance to Work Factors
    8. Temporal Factors
      1. Exercise 2.06: Investigating Absence Hours, Based on the Day of the Week and the Month of the Year
      2. Activity 2.01: Analyzing the Service Time and Son Columns
    9. Summary
  5. 3. Analyzing Bank Marketing Campaign Data
    1. Introduction
    2. Initial Data Analysis
      1. Exercise 3.01: Analyzing Distributions of Numerical Features in the Banking Dataset
      2. Exercise 3.02: Analyzing Distributions of Categorical Features in the Banking Dataset
    3. Impact of Numerical Features on the Outcome
      1. Exercise 3.03: Hypothesis Test of the Difference of Distributions in Numerical Features
    4. Modeling the Relationship via Logistic Regression
    5. Linear Regression
    6. Logistic Regression
      1. Exercise 3.04: Logistic Regression on the Full Marketing Campaign Data
      2. Activity 3.01: Creating a Leaner Logistic Regression Model
    7. Summary
  6. 4. Tackling Company Bankruptcy
    1. Introduction
      1. Explanation of Some of the Important Features
      2. Importing the Data
      3. Exercise 4.01: Importing Data into DataFrames
      4. Pandas Profiling
        1. Running Pandas Profiling
        2. Pandas Profiling Report for DataFrame 1
        3. Pandas Profiling Report for DataFrame 2
    2. Missing Value Analysis
      1. Exercise 4.02: Performing Missing Value Analysis for the DataFrames
    3. Imputation of Missing Values
      1. Mean Imputation
      2. Exercise 4.03: Performing Mean Imputation on the DataFrames
      3. Iterative Imputation
      4. Exercise 4.04: Performing Iterative Imputation on the DataFrame
    4. Splitting the Features
    5. Feature Selection with Lasso
      1. Lasso Regularization for Mean-Imputed DataFrames
      2. Lasso Regularization for Iterative-Imputed DataFrames
      3. Activity 4.01: Feature Selection with Lasso
    6. Summary
  7. 5. Analyzing the Online Shopper's Purchasing Intention
    1. Introduction
      1. Data Dictionary
    2. Importing the Data
    3. Exploratory Data Analysis
      1. Univariate Analysis
        1. Baseline Conversion Rate from the Revenue Column
        2. Visitor-Wise Distribution
        3. Traffic-Wise Distribution
      2. Exercise 5.01: Analyzing the Distribution of Customers Session on the Website
        1. Region-Wise Distribution
      3. Exercise 5.02: Analyzing the Browser and OS Distribution of Customers
        1. Administrative Pageview Distribution
        2. Information Pageview Distribution
        3. Special Day Session Distribution
      4. Bivariate Analysis
        1. Revenue Versus Visitor Type
        2. Revenue Versus Traffic Type
      5. Exercise 5.03: Analyzing the Relationship between Revenue and Other Variables
      6. Linear Relationships
        1. Bounce Rate versus Exit Rate
        2. Page Value versus Bounce Rate
        3. Page Value versus Exit Rate
        4. Impact of Administration Page Views and Administrative Pageview Duration on Revenue
        5. Impact of Information Page Views and Information Pageview Duration on Revenue
    4. Clustering
      1. Method to Find the Optimum Number of Clusters
      2. Exercise 5.04: Performing K-means Clustering for Informational Duration versus Bounce Rate
        1. Performing K-means Clustering for Informational Duration versus Exit Rate
      3. Activity 5.01: Performing K-means Clustering for Administrative Duration versus Bounce Rate and Administrative Duration versus Exit Rate
    5. Summary
  8. 6. Analysis of Credit Card Defaulters
    1. Introduction
    2. Importing the Data
    3. Data Preprocessing
    4. Exploratory Data Analysis
      1. Univariate Analysis
      2. Bivariate Analysis
      3. Exercise 6.01: Evaluating the Relationship between the DEFAULT Column and the EDUCATION and MARRIAGE Columns
        1. PAY_1 versus DEFAULT
        2. Balance versus DEFAULT
      4. Exercise 6.02: Evaluating the Relationship between the AGE and DEFAULT Columns
    5. Correlation
      1. Activity 6.01: Evaluating the Correlation between Columns Using a Heatmap
    6. Building a Profile of a High-Risk Customer
    7. Summary
  9. 7. Analyzing the Heart Disease Dataset
    1. Introduction
      1. Exercise 7.01: Loading and Understanding the Data
      2. Outliers
      3. Exercise 7.02: Checking for Outliers
      4. Activity 7.01: Checking for Outliers
      5. Exercise 7.03: Plotting the Distributions and Relationships Between Specific Features
      6. Activity 7.02: Plotting Distributions and Relationships between Columns with Respect to the Target Column
      7. Exercise 7.04: Plotting the Relationship between the Presence of Heart Disease and Maximum Recorded Heart Rate
      8. Activity 7.03: Plotting the Relationship between the Presence of Heart Disease and the Cholesterol Column
      9. Exercise 7.05: Observing Correlations with a Heatmap
    2. Summary
  10. 8. Analyzing Online Retail II Dataset
    1. Introduction
    2. Data Cleaning
      1. Exercise 8.01: Loading and Cleaning Our Data
    3. Data Preparation and Feature Engineering
      1. Exercise 8.02: Preparing Our Data
    4. Data Analysis
      1. Exercise 8.03: Finding the Answers in Our Data
      2. Activity 8.01: Performing Data Analysis on the Online Retail II Dataset
    5. Summary
  11. 9. Analysis of the Energy Consumed by Appliances
    1. Introduction
      1. Exercise 9.01: Taking a Closer Look at the Dataset
      2. Exercise 9.02: Analyzing the Light Energy Consumption Column
      3. Activity 9.01: Analyzing the Appliances Energy Consumption Column
      4. Exercise 9.03: Performing Feature Engineering
      5. Exercise 9.04: Visualizing the Dataset
      6. Activity 9.02: Observing the Trend between a_energy and day
      7. Exercise 9.05: Plotting Distributions of the Temperature Columns
      8. Activity 9.03: Plotting Distributions of the Humidity Columns
      9. Exercise 9.06: Plotting out_b, out_hum, visibility, and wind
    2. Summary
  12. 10. Analyzing Air Quality
    1. Introduction
    2. About the Dataset
      1. Exercise 10.01: Concatenating Multiple DataFrames and Checking for Missing Values
    3. Outliers
      1. Exercise 10.02: Identifying Outliers
      2. Activity 10.01: Checking for Outliers
    4. Missing Values
      1. Exercise 10.03: Dealing with Missing Values
      2. Exercise 10.04: Observing the Concentration of PM25 and PM10 per Year
      3. Activity 10.02: Observing the Pollutant Concentration per Year
      4. Activity 10.03: Observing Pollutant Concentration per Month
    5. Heatmaps
      1. Exercise 10.05: Checking for Correlations between Features
    6. Summary
  13. Appendix
    1. 1. Bike Sharing Analysis
      1. Activity 1.01: Investigating the Impact of Weather Conditions on Rides
    2. 2. Absenteeism at Work
      1. Activity 2.01: Analyzing the Service Time and Son Columns
    3. 3. Analyzing Bank Marketing Campaign Data
      1. Activity 3.01: Creating a Leaner Logistic Regression Model
    4. 4. Tackling Company Bankruptcy
      1. Activity 4.01: Feature Selection with Lasso
    5. 5. Analyzing the Online Shopper's Purchasing Intention
      1. Activity 5.01: Performing K-means Clustering for Administrative Duration versus Bounce Rate and Administrative Duration versus Exit Rate
    6. 6. Analysis of Credit Card Defaulters
      1. Activity 6.01: Evaluating the Correlation between Columns Using a Heatmap
    7. 7. Analyzing the Heart Disease Dataset
      1. Activity 7.01: Checking for Outliers
      2. Activity 7.02: Plotting Distributions and Relationships between Columns with Respect to the Target Column
      3. Activity 7.03: Plotting the Relationship between the Presence of Heart Disease and the Cholesterol Column
    8. 8. Analyzing Online Retail II Dataset
      1. Activity 8.01: Performing Data Analysis on the Online Retail II Dataset
    9. 9. Analysis of the Energy Consumed by Appliances
      1. Activity 9.01: Analyzing the Appliances Energy Consumption
      2. Activity 9.02: Observing the Trend between a_energy and day
      3. Activity 9.03: Plotting Distributions of the Humidity Columns
    10. 10. Analyzing Air Quality
      1. Activity 10.01: Checking for Outliers
      2. Activity 10.02: Observing the Pollutant Concentration per Year
      3. Activity 10.03: Observing Pollutant Concentration per Month

Product information

  • Title: The Data Analysis Workshop
  • Author(s): Gururajan Govindan, Shubhangi Hora, Konstantin Palagachev
  • Release date: July 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781839211386